Assignment 9: Stencils

1. Parallel 2D Jacobi. Implement the  2D Jacobi smoother shown in the lecture with OpenMP (omit the convergence check) and execute a fixed number of update sweeps (e.g., 100).

(a) Choose a problem size of 4000x4000 (double precision) and report the performance in MLUP/s for 1..10 threads on one socket of Emmy. Explain the observed performance characteristic.

(b) Assuming a code balance of 24 Bytes/LUP in memory, what bandwidth in GByte/s does your code draw at its maximum performance from the main memory of a socket?

(c) What happens if you run the code with 20 cores on two sockets? Is this behavior expected?

You can measure the memory bandwidth and other hardware metrics using the likwid-perfctr tool. In order to use it you have to submit the batch job using the "likwid" property:

$qsub -l nodes=1:ppn=40:f2.2:likwid,walltime=01:00:00 -I For starting your run you load the "likwid" module and run your application with likwid-perfctr as a wrapper:$ module load intel64 likwid/4.3.2
\$ likwid-perfctr -C S0:0-9 -f -g MEM ./a.out

Used in this way, likwid-perfctr does a simple end-to-end measurement of the chosen performance group (MEM in this case, which contains memory-related metrics for the whole socket). Take care to perform enough time steps so that the startup phase becomes negligible! Pinning is done automatically if you use -C, so there is no need to apply likwid-pin on top.

Other interesting performance groups are L3, L2, FLOPS_DP. Use "likwid-perfctr -a" to get a full list. Docs are available at http://tiny.cc/LIKWID.

2. Performance modeling of a 3D Jacobi smoother. Consider the following OpenMP-parallel 3D Jacobi code:

#pragma omp parallel private(iter){  for(iter=0; iter<maxiter; ++iter) {    #pragma omp for schedule(runtime)    for(k=1; k<N-1; ++k) {      for(j=1; j<N-1; ++j) {        for(i=1; i<N-1; ++i) {          f[t1][k][j][i] = 1./6. *                       (f[t0][k-1][j][i]+                       f[t0][k+1][j][i]+                       f[t0][k][j-1][i]+                       f[t0][k][j+1][i]+                       f[t0][k][j][i-1]+                       f[t0][k][j][i+1]);        }      }    }    swap(t0,t1);  }}

Assume that N=350, all arrays are in double precision, and that we run this code on a ten-core Intel IvyBridge processor (like the one in Emmy - 25 MB of shared L3 cache, 256 kB of per-core L2, 32 kB of per-core L1, memory bandwidth of 41 GB/s).

(a) Is the code correct? If it is not, fix it.

(b) We set OMP_NUM_THREADS=10 and OMP_SCHEDULE="static" and run the code on one socket. What is the relevant bottleneck? What is the maximum performance we can expect? How about N=200?

(c) Someone says: "It is better to choose the data layout f[k][j][i][t], because the source and the target arrays are handled together in the same loop, and it will be better to interleave them." What is your comment on this?

(d) What happens if we set OMP_SCHEDULE="static,1"at N=350? (Assume the original data layout)

(e) Instead of parallelizing the outer (k) loop, we might just as well parallelize the inner (i) loop. Assume that the cost for the implicit barrier at the end of a parallel for loop is 2000 CPU cycles. At N=350, what is the expected performance when the inner loop is parallelized (assuming OMP_SCHEDULE="static")? At which N would the performance exceed the level predicted in (b)? Can we do something about the barrier overhead when the inner loop is parallelized? (Hint: There is a way to switch off the automatic barrier at the end of work-sharing loops in OpenMP. Think carefully about how to leverage this feature here and still have a correct code).