A speaker at a conference presents performance data for her highly
scalable code on up to 1000 cores of a recent multicore-based
supercomputer platform (see the graph on the right). She says "We
achieve a parallel efficiency of about 0.5 on 1,000 cores, so there must
be hidden optimization potential that we will investigate."
What is the natural question to ask?
Computing faster with slower CPUs? The BlueGene (BG)
supercomputers by IBM have held top positions in the Top500 list for
several years. The philosophy behind their design is to use a reasonable
network and a rather slow, but low-power processor to get a balanced
system (in terms of processor speed vs. network bandwidth). It is clear
that, compared to a parallel computer with the same network but faster
CPU, a BlueGene must always run slower on the same problem with the same
number of CPUs. So why are they doing that and why does it seem to work
Develop a refined performance model for BlueGene-type
systems by starting from Amdahl's Law with our simple communication
model (overhead = kN). Assume that a BG processor is µ times slower than
a standard processor (μ≥1) and that everything else stays the same
(ratio s/p, network characteristics,...). You may use computer algebra
software to solve the following tasks although it is entirely possible
to do it by hand.
What is the maximum possible speedup
that can be achieved in relation to a single BG processor? To answer
this question you have to formulate a mathematical expression that
relates the execution time of the parallel code on N BlueGene processors
to the execution time of the serial code on one BG CPU. Assume that
s+p=1 on a standard CPU with µ=1, i.e. for the serial run on a BG CPU
the execution time is µ(s+p). If you then form the first derivative
w.r.t. N, set it to zero and solve for N you get the "sweet spot", i.e.
the number of CPUs where speedup is best.This number depends on s, k and
µ. You can take it from there on your own.
Compare with the situation at µ=1 (fast "standard" processors). Now you see why BlueGene is said to show "good scaling".
What we actually want to see is a comparison of performance, not scalability.
How does maximum performance on the BG relate to maximum performance on
a parallel computer with standard processors? Does BlueGene have an
The "maximum performance" spot is not where you
really want to operate big machines, because parallel efficiency is
usually quite low there. It is better to look at smaller CPU numbers
where performance is still roughly linear in N. In this limit, if we
compare µN BlueGene processors with N standard processors, is there an
advantage for BG? Motivate for which kinds of communication overhead
models BlueGene is expected to show better performance at µN processors
than the standard system at N processors, i.e., how should communication
overhead scale with N? Hint: Use a general function
c(N) instead of kN for the overhead and derive a condition that c(N)
must fulfill so that µN slow processors have an advantage against N
Does 3D Cartesian Domain Decomposition with next-neighbor halo exchange fulfill the requirements of a "favorable" communication function c(N) in the above context? What about 2D and 1D decomposition ("poles" and "slabs")?
Oversubscribed fat tree.
Construct the largest possible fat-tree network with one leaf and one
spine level out of 36-port QDR-InifinBand switches and 2:1
oversubscription. How many compute nodes can be connected to this
hierarchy, and how many 36-port switches are needed? Would you be able
to build a switch hierarchy with a 3:2 oversubscription? What would be
the main problem with this?