## Roofline model for the vector triad on Emmy

Model the performance of the vector triad (you should have a working parallel OpenMP code for that) on an "Emmy" processor. The 10-core Ivy Bridge CPUs in Emmy have an attainable saturated memory bandwidth of about 42 GB/s and a base clock speed of 2.2 GHz. One core can execute one full-width (AVX) LOAD and one half full-width AVX STORE per cycle (meaning that the throughput of STORE instructions is 0.5 instructions per cycle). The core can also execute one AVX MULT and one AVX ADD instruction per cycle. There are no FMA instructions on this processor.

- What is the full-chip P
_{max}for the vector triad**A(:)=B(:)+C(:)*D(:)**? - What is the in-memory code balance?
- What is the expected performance for the vector triad on the full chip at a problem size of N=10
^{7}? Assume there is no overhead from parallelization, which is reasonable at this problem size. Can you validate your model? - What would happen if you substituted the multiplication by a divide:
**A(:)=B(:)+C(:)/D(:)**? Knowing that the AVX divide instruction in this CPU has a throughput of 1/28 instructions per cycle, repeat the analysis. Is the code memory bound? Validate your model by measurements! - How many divides would you need (with everything else unchanged) to decouple from main memory?

Last modified: Tuesday, 14 March 2017, 8:46 AM