## Assignment 8: Slow computing, speedup games, networks

*Computing faster with slower CPUs?***(50 credits)**The BlueGene (BG) supercomputers by IBM held top positions in the Top500 list for several years. The philosophy behind their design is to use a reasonable network and a rather slow, but low-power processor to get a balanced system (in terms of processor speed vs. network bandwidth). It is clear that, compared to a parallel computer with the same network but faster CPU, a BlueGene must always run slower on the same problem with the same number of CPUs. So why are they doing that and why does it seem to work so well?

Develop a refined performance model for BlueGene-type systems by starting from Amdahl's Law with our simple communication model (overhead = \( kN \)). Assume that a BG processor is \( \mu \) times slower than a standard processor (\( \mu\geq 1 \)) and that everything else stays the same (ratio \( s/p \), network characteristics,...). You may use computer algebra software to solve the following tasks although it is entirely possible to do it by hand.

- What is the maximum possible speedup that can be achieved in relation to a single BG processor? To answer this question you have to formulate a mathematical expression that relates the execution time of the parallel code on \( N \) BlueGene processors to the execution time of the serial code on one BG CPU. Assume that \( s+p=1 \) on a standard CPU with \( \mu=1 \), i.e. for the serial run on a BG CPU the execution time is \( \mu(s+p) \). If you then form the first derivative w.r.t. \( N \), set it to zero, and solve for \( N \), you get the "sweet spot", i.e. the number of CPUs where speedup is best.This number depends on \( s \), \( k \), and \( \mu \). You can take it from there on your own.
- Compare with the situation at \( \mu=1 \) (fast "standard" processors). Now you see why BlueGene is said to show "good scaling".
- What we actually want to see is a comparison of performance, not scalability. How does maximum performance on the BG relate to maximum performance on a parallel computer with standard processors? Does BlueGene have an advantage here?
- The "maximum performance" spot is not where you really want to operate big machines, because parallel efficiency is usually quite low there. It is better to look at smaller CPU numbers where performance is still roughly linear in \( N \). In this
limit, if we compare \( \mu N \) BlueGene processors with \( N \) standard processors, is there an advantage for BG? Motivate for which kinds of communication overhead models BlueGene is expected to show better performance at \( \mu N \) processors
than the standard system at \( N \) processors, i.e., how should communication overhead scale with \( N \)?

Hint: Use a general function \( c(N) \) instead of \( kN \) for the overhead and derive a condition that \( c(N) \) must fulfill so that \( \mu N \) slow processors have an advantage against \( N \) standard processors. - Show that 3D Cartesian Domain Decomposition with next-neighbor halo exchange fulfills the requirements of a "favorable" communication function \( c(N) \) in the above context! What about 2D and 1D decomposition ("poles" and "slabs")?

*Oversubscribed fat tree*.**(10 credits)**Construct the largest possible fat-tree network with one leaf and one spine level out of 36-port QDR-InifinBand switches and 2:1 oversubscription. How many compute nodes can be connected to this hierarchy, and how many 36-port switches are needed?*Halo exchange*.**(40 credits)**Assume you have a parallel computer with a memory bandwidth of 200 GB/sec per node, an inter-node communication latency of 2 \( \mu s \), an asymptotic inter-node communication bandwidth of 12 Gbyte/s (full-duplex, per direction), and an (assumed) fully nonblocking fat-tree network. On this system we execute a double-precision 3D Jacobi smoother (as discussed in the lecture) with halo exchange. The per-node subdomain has \( N_d^3 \) elements. Assume that "perfect" spatial blocking is in place, i.e., the minimum code balance applies, and that intra-node parallelization is overhead free. You can also assume that the entire communication overhead is in the network communication (no time needed for buffer copying etc.). One halo layer must be communicated per subdomain face and direction (cubic domain decomposition).

(a) Calculate the maximum per-node memory-bound performance in LUP/s.

(b) Calculate the inter-node communication message length \( n_{1/2} \) in bytes at which half of the asymptotic effective communication bandwidth is attained.

(c) Calculate the linear subdomain size \( N_d \) where the communication to neighboring nodes becomes bandwidth bound. You may use the threshold from (b) as the criterion for "bandwidth bound."

(d) Calculate the linear subdomain size \( N_d \) where the update sweep takes the same amount of time as the communication of the halos to/from the six neighboring nodes. What do you conclude from the result? Is this "performance model" correct?