Assignment 7: Pretty pictures, barriers, random numbers, and speedup

  1. Parallel ray tracer. (40 credits) Parallelize a simple ray tracer code using OpenMP. You can find the source (C and F90) in ~unrz55/GettingStarted/raytrace{_c.c,_f.f90}

    The picture is divided into so-called tiles (see figure). The function which is responsible for calculating a tile is called calc_tile(). Important parameters are size (the number of pixels in one direction) and tilesize (the number of pixels of one tile in one direction), both hard-coded at the start of main(). The commands that print the picture in PNM format are commented out, but you can use them to check for correctness. You can display the image with the ’display’ tool from the ImageMagick toolkit.

    (a) Draw a diagram of performance vs. number of cores on up to 20 physical cores of an Emmy node, using a picture size of 8000x8000 pixels.  
    (b) Is the memory bandwidth a bottleneck? How can you tell?  Hint: each pixel is one byte (8 bits), and the memory bandwidth of an Emmy socket is about 40 GByte/s.
    (c) Does your code scale up to 20 cores? If is does not, improve the scaling!  (Hint. Depending on how you did the parallelization, you may want to look up the collapse clause in OpenMP. Also remember that not all pixels are equal.) Is there an optimal tile size? Can you optimize the sequential code? What is the best performance in pixels per second you can get on an Emmy node? Document all your settings so that your experiment can be reproduced.

  2. Barrier pain. (30 credits) Write a benchmark that measures the time it takes to execute an OpenMP barrier.

    (a) Plot the barrier time in cycles on 1..20 cores (compact pinning) on one Emmy node using the default Intel compiler and the default GNU compiler.
    (b) On an Intel Xeon Phi coprocessor (1.05 GHz, 60 cores, 8-way SIMD at double precision, 16 Flops per core per cycle) the barrier takes 18000 cycles (Intel compiler). How can we usefully compare this number with the one from (a)? Is 18000 cycles "slow" or "fast" compared to the Emmy barrier? Hint: Think about how much "work" (i.e., Flops) can be done per cycle on either system, and how "precious" a cycle thus is. Hence, if the same problem is executed on the two systems and you assume that the code (excluding the barriers) can reach peak performance, you can calculate the "pain" incurred by the barrier, i.e., how much performance loss it causes for a fixed amount of work.
    (c) What is the expected performance of the Schönauer vector triad at a loop length of N=500 with 10 threads (cores) on an Emmy socket, assuming that the executed loop code is "perfect" and the barrier is the only overhead?


  3. Parallel histogram computation. (20 credits) The following code fragment calculates a histogram with 16 bins from the results of the least significant four bits of the standard rand_r() random number generator:


    unsigned int seed = 123;
    long hist[16];
    for(int i=0; i<16; ++i)
      hist[i]=0;
      timing(&wcstart, &ct);
    for(long i=0; i<1000000000; ++i) {
        hist[rand_r(&seed) & 0xf]++;
      }
    timing(&wcend, &ct);
    for(int i=0; i<16; ++i) {
        cout << "hist[" << i << "]=" << hist[i] << endl;
      }
    cout << "Time: " << wcend-wcstart << " sec" << endl;


    Parallelize the histogram calculation using OpenMP. You can find serial (Fortran and C) example codes in the ~unrz55/GettingStarted/HISTO folder. Draw performance vs. number of cores in a diagram. Does your solution scale? If it does not, make it scale!

    Hint: Try different solutions. Think simple first.



  4. Speedup games. (10 credits) A speaker at a conference presents performance data for her highly scalable code on up to 1000 cores of a recent multicore-based supercomputer platform (see the graph on the right). She says "We achieve a parallel efficiency of about 0.5 on 1,000 cores, so there must be hidden optimization potential that we will investigate."

    What is the natural question to ask?