## Assignment 11 (last): Speedup games, slow computing

1. Speedup games. A speaker at a conference presents performance data for her highly scalable code on up to 1000 cores of a recent multicore-based supercomputer platform (see the graph on the right). She says "We achieve a parallel efficiency of about 0.5 on 1,000 cores, so there must be hidden optimization potential that we will investigate."

What is the natural question to ask?

2. Computing faster with slower CPUs? The BlueGene (BG) supercomputers by IBM have held top positions in the Top500 list for several years. The philosophy behind their design is to use a reasonable network and a rather slow, but low-power processor to get a balanced system (in terms of processor speed vs. network bandwidth). It is clear that, compared to a parallel computer with the same network but faster CPU, a BlueGene must always run slower on the same problem with the same number of CPUs. So why are they doing that and why does it seem to work so well?

Develop a refined performance model for BlueGene-type systems by starting from Amdahl's Law with our simple communication model (overhead = kN). Assume that a BG processor is µ times slower than a standard processor (μ≥1) and that everything else stays the same (ratio s/p, network characteristics,...). You may use computer algebra software to solve the following tasks although it is entirely possible to do it by hand.

• What is the maximum possible speedup that can be achieved in relation to a single BG processor? To answer this question you have to formulate a mathematical expression that relates the execution time of the parallel code on N BlueGene processors to the execution time of the serial code on one BG CPU. Assume that s+p=1 on a standard CPU with µ=1, i.e. for the serial run on a BG CPU the execution time is µ(s+p). If you then form the first derivative w.r.t. N, set it to zero and solve for N you get the "sweet spot", i.e. the number of CPUs where speedup is best.This number depends on s, k and µ. You can take it from there on your own.
• Compare with the situation at µ=1 (fast "standard" processors). Now you see why BlueGene is said to show "good scaling".
• What we actually want to see is a comparison of performance, not scalability. How does maximum performance on the BG relate to maximum performance on a parallel computer with standard processors? Does BlueGene have an advantage here?
• The "maximum performance" spot is not where you really want to operate big machines, because parallel efficiency is usually quite low there. It is better to look at smaller CPU numbers where performance is still roughly linear in N. In this limit, if we compare µN BlueGene processors with N standard processors, is there an advantage for BG? Motivate for which kinds of communication overhead models BlueGene is expected to show better performance at µN processors than the standard system at N processors, i.e., how should communication overhead scale with N?
Hint: Use a general function c(N) instead of kN for the overhead and derive a condition that c(N) must fulfill so that µN slow processors have an advantage against N standard processors.
• Does 3D Cartesian Domain Decomposition with next-neighbor halo exchange fulfill the requirements of a "favorable" communication function c(N) in the above context? What about 2D and 1D decomposition ("poles" and "slabs")?
3. Oversubscribed fat tree. Construct the largest possible fat-tree network with one leaf and one spine level out of 36-port QDR-InifinBand switches and 2:1 oversubscription. How many compute nodes can be connected to this hierarchy, and how many 36-port switches are needed? Would you be able to build a switch hierarchy with a 3:2 oversubscription? What would be the main problem with this?