Assignment 6: Stencils, stencils

  1. Parallel 2D Jacobi. (30 credits) Implement the  2D Jacobi smoother shown in the lecture with OpenMP (omit the convergence check) and execute a fixed number of update sweeps (e.g., 200). Include the code of the main update sweep in your submission.

    (a) Choose a problem size of 4000x4000 (double precision) and report the performance in MLUP/s for 1..10 threads on one socket of Emmy. Explain the observed performance characteristic.
    (b) Assuming a code balance of 24 Bytes/LUP in memory, what bandwidth in GByte/s does your code draw at its maximum performance from the main memory of a socket?
    (c) What happens if you run the code with 20 cores on two sockets? Is this behavior expected? Fix it if you can.

    You can measure the memory bandwidth and other hardware metrics using the likwid-perfctr tool. In order to use it you have to submit the batch job using the "likwid" property:

    $ qsub -l nodes=1:ppn=40:f2.2:likwid,walltime=01:00:00 -I

    For starting your run you load the "likwid" module and run your application with likwid-perfctr as a wrapper:

    $ module load intel64 likwid/4.3.2
    $ likwid-perfctr -C S0:0-9 -f -g MEM ./a.out

    Used in this way, likwid-perfctr does a simple end-to-end measurement of the chosen performance group (MEM in this case, which contains memory-related metrics for the whole socket). Take care to perform enough time steps so that the startup phase becomes negligible! Pinning is done automatically if you use -C, so there is no need to apply likwid-pin on top.

    Other interesting performance groups are L3, L2, FLOPS_DP. Use "likwid-perfctr -a" to get a full list. Docs are available at

  2. Parallel Gauss-Seidel. (70 credits) We want to parallelize a 2D Gauss-Seidel solver using OpenMP and a pipeline parallel processing strategy (as outlined in the lecture). We stick to the original algorithm (i.e., no fishy red-black tricks etc.). A serial sample code can be found in ~unrz55/GettingStarted/gs_plain.c.

    (a) Parallelize the code along the same lines as shown in the lecture for the 3D case. Run your code with a problem size of 8000x8000 on 1..10 threads of an Emmy socket and report performance vs. thread count. What ccNUMA-aware data placement strategy should be implemented when running the code on two sockets?

    (b) Do "layer conditions" apply? What is the expected full-socket performance assuming bandwidth-bound execution? 

    (c) Estimate the impact of thread synchronization (barriers), assuming a 2000-cycle penalty for a full-socket OpenMP barrier (5000 cycles for the full node). Does it impact the performance significantly?

    (d) Using likwid-perfctr (see previous exercise), measure the memory bandwidth and data volume of your parallel Gauss-Seidel code on one Emmy socket when using 10 threads. Calculate the actual code balance from the measurement. Does it match your model?

    Hint: For accurate measurements you either have to use the LIKWID marker API (see for a C-based tutorial) or make sure that your program runs long enough to amortize any startup overhead.

    (e) Will SMT (Simultaneous Multi-Threading) help with the performance of your parallel solver? Find out how to pin your threads correctly and report performance with 1..20 threads on one Emmy socket. Does the result match your expectations?