## Assignment 4: Code balance, optimization, power dissipation

*Balance*. The in-memory code balance of a loop is the ratio of (number of bytes transferred to and from memory) to (amount of work [flops] done). It is a pivotal target metric for performance optimization. Calculate the in-memory code balance for loops with the following loop bodies, assuming that all arrays do not fit into any cache (a loop over i is always implied):

(a) s = s + a[i]; (double precision)

(b) s = s + a[i]*b[i]; (double precision)

(c) a[i] = b[i] + s*c[i]; (single precision)

(d) a[i] = a[i] + s*c[i]; (single precision)

(e) y[i] = y[i] + x[i] * v[index[i]]; (DP for x[],y[], and v[], and int32 for index[])

In case (e) you have to be a little imaginative; obviously it is impossible to give a single, definite answer...

(f) What is the code balance of the dense MVM code (assignment 3, task 1) in relation to the matrix size? You can assume that the matrix does not fit into any cache.*Optimizing for code balance*. Suppose you have a code like this:

for(i=0; i<N; ++i) { a[i] = b[i] * c[i]; } for(i=1; i<N-1; ++i) { b[i] = (a[i-1] + a[i+1])*0.5; }

All arrays hold DP floating-point numbers, and N is "large".

(a) Can you suggest an optimization that will improve the performance of the code? What is the expected change in the memory code balance?

(b) Construct a single-core performance model (along the lines of what we did in the lecture with the vector triad) for the optimized code on an Intel Haswell core.*Multicore power envelope*. We wish to understand why processor vendors trade clock frequency for concurrency when introducing multicore chips, so we establish a simple model for power dissipation. Assume that a certain chip (with a given manufacturing process, e.g., 28 nm) dissipates a power of W=W_{d}when running at its base frequency f_{0}. Furthermore assume that W is proportional to f^{3}.

(a) "Overclocking" is a popular strategy (especially among gamers) to speed up the CPU. What is the power dissipation W when the clock frequency is increased by 30% (e.g., from 2.0 GHz to 2.6 GHz)?

(b) If we require that the overall power dissipation of a chip shall be constant (W_{d}), we can trade clock speed for cores: By reducing the clock speed by some relative amount Δf/f_{0}, we can put more cores (transistors) on the same chip and still stay within the power limit. What is the required clock speed reduction Δf/f_{0}for an m-core chip with a power dissipation of W_{d}? How many cores can be put on the chip when the clock speed is reduced by a factor of two?

(c) Now assume m=8. What is the minimum and maximum performance gain (compared to the single-core chip with m times fewer transistors but the same power envelope, and the same memory interface) when using all 8 cores? For which types of code do you expect these extremal performance gains?