Assignment 10: Parallelization galore

  1. Parallel histogram computation. The following code fragment calculates a histogram with 16 bins from the results of the least significant four bits of the standard rand_r() random number generator:


    unsigned int seed = 123;
    long hist[16];
    for(int i=0; i<16; ++i)
      hist[i]=0;
      timing(&wcstart, &ct);
    for(long i=0; i<1000000000; ++i) {
        hist[rand_r(&seed) & 0xf]++;
      }
    timing(&wcend, &ct);
    for(int i=0; i<16; ++i) {
        cout << "hist[" << i << "]=" << hist[i] << endl;
      }
    cout << "Time: " << wcend-wcstart << " sec" << endl;


    Parallelize the histogram calculation using OpenMP. You can find serial (Fortran and C) example codes in the ~unrz55/GettingStarted/HISTO folder. Look at the runtimes for the original (serial) code, the 1-thread and the 20-thread parallel versions on one Emmy node. Does your solution scale? If it does not, make it scale!

    Hint: Try different solutions. Think simple first.

  2. Parallel Gauss-Seidel. We want to parallelize a 2D Gauss-Seidel solver using OpenMP and a pipeline parallel processing strategy (as outlined in the lecture). We stick to the original algorithm (i.e., no fishy red-black tricks etc.). A serial sample code can be found in ~unrz55/GettingStarted/gs_plain.c.


    (a) Parallelize the code along the same lines as shown in the lecture for 3D. Run your code with a problem size of 8000x8000 on 1..10 threads of an Emmy socket and report performance vs. thread count. What ccNUMA-aware data placement strategy should be implemented when running the code on two sockets?

    (b) Do "layer conditions" apply? What is the expected full-socket performance assuming bandwidth-bound execution? 

    (c) Estimate the impact of thread synchronization (barriers), assuming a 2000-cycle penalty for a full-socket OpenMP barrier (5000 cycles for the full node).


    (d) Using likwid-perfctr (see previous exercise), measure the memory bandwidth and data volume of your parallel Gauss-Seidel code on one Emmy socket when using 10 threads. Calculate the actual code balance from the measurement. Does it match your model?

    Hint: For accurate measurements you either have to use the LIKWID marker API (see https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerC for a C-based tutorial) or make sure that your program runs long enough to amortize any startup overhead.

    (e) Will SMT (Simultaneous Multi-Threading) help with the performance of your parallel solver? Find out how to pin your threads correctly and report performance with 1..20 threads on one Emmy socket. Does the result match your expectations?