## Assignment 5: OpenMP, Roofline

**(10 credits)***Handling recurrence.*Parallelize the second loop in the following code using OpenMP:#define N 10000000

int main() {

int i,b[N],c[N],d[N];

for(i=0;i<N;i++)

b[i]=c[i]=d[i]=rand();

// parallelize this loop

for(i=1;i<N;i++) {

b[i]=1+i;

c[i]=b[i-1]+i;

d[i]=c[i-1]+i;

}

}

If you are unsure about the correctness of your code after parallelization, run it.**(20 credits)***π by the Monte Carlo method*. The quarter circle in the ﬁrst quadrant with origin at (0,0) and radius 1 has an area of π/4. Look at the random number pairs in [0, 1] × [0, 1]. The probability that such a point lies inside the quarter circle is π/4, so given enough statistics we are able to calculate π using this so-called “Monte Carlo” method.

(a) Write a parallel OpenMP program that performs this task. Use the rand_r() function to get separate random number sequences for all threads. Make sure that adding more threads does not harm the statistics (how is this achieved?).

(b) What is the best relative accuracy that you can achieve with ten "Emmy" cores in one second of walltime? Consider and describe any code optimizations (beyond parallelization) that can speed up the computation!

(Hint: You may assume that the random number function rand_r() produces "perfect" random numbers with no correlations whatsoever)**(30 credits)***Parallel vector triad*. Parallelize the standard vector triad benchmark from earlier assignments by using OpenMP. To do this, compile and link with the -qopenmp switch and use the "fused" parallel for/do directives (#pragma omp parallel for).

Perform benchmark runs with the parallel triad on one Emmy socket. Draw performance graphs for N = 10^{1}...10^{7}(use log scaling on the x axis) for 1...10 threads (draw them all in the same diagram). Use likwid-pin to control the placement of threads. Example:$ module load

$ likwid-pin -c S0:0-9 ./vectortriad.exe

This runs the benchmark with 10 threads on the first socket (likwid-pin sets the OMP_NUM_THREADS variable automatically according to the pin mask if it is not set already).

(a) Comment on the scalability of the benchmark for different sizes (L1 cache, L2 cache, L3 cache, memory) and compare with the purely serial code (compiled without -qopenmp). For which working set sizes do you get a "good" speedup with multiple threads?

(b) For N=10^{7}run the benchmark with 1...10 threads on one socket and draw a scaling graph. From the data, calculate the maximum memory bandwidth that you can achieve.

Note: The usual precautions regarding pinning overhead apply. Do*not*benchmark the first parallel region that is entered during execution.**(40 credits)***Triangular parallel MVM*. Consider the multiplication of a dense lower triangular matrix with a vector:

Write a benchmark code for this kernel.**for(r=0; r<N; ++r)**

**for(c=0; c<=r; ++c)****y[r] += a[r][c] * x[c];**

(a) Parallelize the code with OpenMP and static loop scheduling. Which loop should be parallelized, and why? Run the code with 1,...,10 threads on one Emmy socket at problem sizes of 1600x1600 and 8000x8000, respectively. Draw a graph of performance in Mflop/s vs. cores (with the two data sets). Discuss the emerging patterns!

(b) Can you improve the performance by choosing an appropriate loop schedule? Present the data as in (a) but with the best loop schedule you could find.

(c) Use the Roofline model to calculate an upper performance limit of the code at problem size 8000x8000 on one Emmy socket. (Hint: The read-only memory bandwidth is about 47 Gbyte/s). Draw the limit into your diagrams above.