Assignment 7: OpenMP, performance


  1. Triangular parallel MVM. Consider the multiplication of a dense lower triangular matrix with a vector:
    for(r=0; r<N; ++r)
    for(c=0; c<=r; ++c)
    y[r] += a[r][c] * x[c];
    Write a benchmark code for this operation. Parallelize it with OpenMP and run it with 1...10 threads on one Emmy socket, scaling the problem size (N) from 50x50 to 8000x8000. 

    (a) Which loop should be parallelized? What is the optimal OpenMP loop schedule? Why?
    (b) Plot performance vs. N for the best and the worst loop schedule.

  2. Accelerators vs. multicore processors. The Nvidia K20 GPGPU and the Intel Xeon Phi coprocessor were accelerators that promise superior performance compared to standard CPUs.

    The Kepler GPGPU has an achievable memory bandwidth of about 160 GByte/s and a peak performance (double precision) of about 1.4 TFlop/s.

    The Xeon Phi "Knights Corner" has an achievable memory bandwidth of about 160 GByte/s and a (DP) peak performance of about 1 TFlop/s.

    We want to compare the performance of these two accelerators with a standard dual-socket Intel Xeon "Ivy Bridge" compute node with 10 cores per socket and a clock speed of 2.2 GHz (which costs about the same as one accelerator card and also draws comparable power). One Ivy Bridge socket has an achievable memory bandwidth of 40 GByte/s and each core can execute eight DP Flops per cycle (4 MULT, 4 ADD) when using AVX vectorization.

    Calculate the performance gain of the following codes when using either accelerator instead of the CPU node (a "long" loop over i is implied). Assume that either the peak performance or the main memory bandwidth is the only bottleneck.

    (a) a[i] = b[i] + c[i] (STREAM add)

    (b) Matrix-matrix multiply (runs at 90% of peak on Ivy Bridge and at 85% of peak on the accelerators)

    How does the picture change for (b) if you run the code on the CPU with a single thread and non-vectorized (i.e., not using SIMD)? How does the performance on the accelerators change if you have to copy the arrays b[] and c[] in (a) to the accelerator via the PCIe 2.0 bus (max. bandwidth of 6 GByte/s) before the loop, and then copy array a[] back afterwards?


  3. Parallel raytracer. Parallelize a simple raytracer code using OpenMP. You can find the source (C and F90) in ~unrz55/GettingStarted/raytrace{_c.c,_f.f90}

    The picture is divided into so-called tiles (see figure). The function which is responsible for calculating a tile is called calc_tile(). Important parameters are size (the number of pixels in one direction) and tilesize (the number of pixels of one tile in one direction), both hard-coded at the start of main(). The commands that print the picture in PNM format are commented out, but you can use them to check for correctness. You can display the image with the ’display’ tool from the ImageMagick toolkit. Does your code scale up to 10 Emmy cores with a picture size of 5000×5000 pixels? Is there an optimal tile size? What is the best runtime (excluding output) you can get on a Emmy node? Also think of scalar optimizations that may be beneficial. Does SMT help?