## Assignment 4: Clock speed, peak performance, power (due June 11)

1. Multicore power envelope. (20 credits) We wish to understand why processor vendors trade clock frequency for concurrency when introducing multicore chips, so we establish a simple model for power dissipation. Assume that a certain chip (with a given manufacturing process, e.g., 28 nm) dissipates a power of W=Wd when running at its base frequency f0. Furthermore assume that W is proportional to f3

(a) "Overclocking" is a popular strategy (especially among gamers) to speed up the CPU. Calculate the power dissipation W when the clock frequency is increased by 30% (e.g., from 2.0 GHz to 2.6 GHz).
(b) If we require that the overall power dissipation of a chip shall be constant (Wd), we can trade clock speed for cores: By reducing the clock speed by some relative amount Δf/f0, we can put more cores (transistors) on the same chip and still stay within the power limit. Calculate the required clock speed reduction Δf/f 0 for an m-core chip with a power dissipation of Wd. Calculate how many cores can be put on the chip when the clock speed is reduced by a factor of two.
(c) Now assume m=8. Calculate the minimum and maximum performance gain (compared to the single-core chip with m times fewer transistors but the same power envelope, and the same memory interface) when using all 8 cores. For which types of code do you expect these extremal performance gains?

2. Turbo mode. (20 credits) Use your "π by integration" code from Assignment 0 and parallelize it with OpenMP. Make sure that it still calculates an approximation to π, even when using multiple threads.

(a) Run your code with N=109 slices on 1,...,10 cores of one socket of Emmy. Do not fix the clock speed of the cores but let it run in Turbo Mode (the default). Then run it again with a fixed clock speed of 2.2 GHz. Present the performance in loop iterations per second versus the number of cores in one diagram for both cases. Comment on the results. Are they as expected?

Note that this code does not have any chip-level bottleneck; all needed execution resources are strictly in-core. Note also that it takes some time for Turbo Mode to kick in on the cores. If the program runs only for a short time, you may not see the full impact of Turbo. It is thus a good idea to run the main loop twice - once for "warmup" and once "for real."

(b) Calculate the clock speed of the cores from the performance measurements. Draw a second graph that shows the clock frequency of the cores versus the number of active cores. You may assume that all active cores have the same clock frequency.

If you enforce the thread-core affinity with likwid-pin, be aware that this causes some overhead for the first parallel region encountered in the program (this is where the threads get created and pinned). In order to avoid such effects, execute a "bogus" parallel region before the actual benchmark loop.

3. Peak performance. (10 credits)
(a) Calculate the arithmetic single-precision floating-point peak performance of an Intel Xeon Phi "Knights Corner" chip with
• 60 cores
• a clock speed of 1.05 GHz
• the AVX-512 instruction set (512-bit registers, one full-width FMA instruction per cycle).

(b) Calculate the arithmetic single-precision peak performance of an Nvidia "Volta" chip with
• 80 SM units (streaming multiprocessors), each with 64 SP FMA units
• a clock speed of 1.4 GHz

4. Optimal energy to solution. (50 credits) We want to model the power consumption of an 8-core processor under a certain workload, and make the following assumptions:

• The chip dissipates a "baseline power" W0 when all cores are idle, independent of the clock frequency.
• If a core is executing instructions, it dissipates the additional ("dynamic") power Wd, which adds to the chip's baseline power.
• The performance of a certain code running on a single core of this processor is P(1).
• The maximum performance of the parallel version of this code is Plimit>P(1). Hence, when solving a given problem in parallel with n cores, the overall performance is min(nP(1),Plimit). This models a behavior where a code is limited by a bottleneck if it is running on multiple cores (i.e., performance "saturates" at some point).

We want to solve a given problem with n=1..8 cores, and calculate the energy E(n) it takes to solve it. This is our "energy to solution" metric. You can assume that the amount of work to be done is normalized to 1, i.e., the time to solution on n cores is T(n)=1/P(n). For the following tasks you may find it helpful to use a spreadsheet program:

(a) Calculate E(n) for W0 = 20 W, Wd = 12 W, P(1) = 1 s-1, Plimit = 5 s-1. For which nmin is E(n=nmin) minimal? Can it make sense to use more than nmin cores?

(b) Now assume we apply a code optimization (such as SIMD vectorization) that improves the single-core performance to P(1)=2.5 s-1, but Plimit is unaffected. How does that change nmin and E(nmin)? What is the general conclusion from this result?

(c) Keeping P(1) as it is, we now perform an optimization that improves the saturated performance Plimit to 7.5 s-1. How does that change nmin and E(nmin)? What is the general conclusion from this result?

(d) What qualitative impact do you expect from changing the clock speed (higher or lower)? Remember that the dynamic power dissipation of each core goes with the third power of the frequency. Assume that that Plimit is unaffected by the clock speed.