Microbenchmarking: the vector triad

Code and parallelize (using OpenMP) the standard vector triad benchmark from the lecture. To do this, compile and link with the -openmp switch and use the "fused" parallel for/do directives in the following way:

Fortran  C

do r=1,NITER
!$OMP PARALLEL DO

XXdo i=1,N
XXXXa(i) = b(i) + c(i) * d(i)
XXenddo
!$OMP END PARALLEL DO
enddo

for(r=0; r<NITER; ++r) {
#pragma omp parallel for
XXfor(i=0; i<N; ++i)
XXXXa[i] = b[i] + c[i] * d[i];
}


To determine the number of processors (threads), set the OMP_NUM_THREADS environment variable to the desired number prior to starting your executable, e.g.:

$ env OMP_NUM_THREADS=10 ./a.out

  1. Perform benchmark runs with the sequential triad code on one Emmy core. Make sure that the actual benchmark loop is repeated often enough (i.e., set NITER appropriately) to get proper measurements (you may consult the skeleton code in the SCAN/ folder). Draw a performance graph (or make a table)  for N = 101...10(use log scaling on the x axis) and make sure the result is similar to what you expect. 

  2. Now fix the clock speed to 2.2 GHz. Can you interpret the observable changes to the data?

  3. Repeat the experiment for 1...20 threads. Use likwid-pin to control the placement of threads. Example:

    $ module load likwid
    $ env OMP_NUM_THREADS=2
     likwid-pin -c N:0-19 ./a.out

    This runs the benchmark with 2 threads on one socket. Comment on the scalability of the benchmark for different sizes (L1 cache, L2 cache, memory) and compare with the purely serial code.

  4. What happens if you open the parallel region outside the repeat loop?
Last modified: Monday, 4 April 2016, 12:54 AM