## Assignment 9: Ping-pong, message aggregation, MPI ray tracer

#### Note: Using MPI on Emmy

If you load an Intel compiler module (e.g., intel64) on Emmy, an appropriate MPI module is loaded automatically.

$module load intel64$ module list
1) intelmpi/2017up04-intel  2) mkl/2017up05             3) intel64/17.0up05

To compile a program, use the Intel-provided compiler scripts for C (mpiicc), C++ (mpiicpc), or Fortran (mpiifort). These behave like the normal compilers but know how to find the include files and libraries for MPI. Example:

$mpiicc -Ofast -xHost hello.c -o hello.exe  MPI programs can only be run within a batch job and not on the frontends. Hence, if you want to try MPI interactively, you have to submit an interactive batch job. Starting an MPI program also requires an MPI wrapper script. We provide a specially adapted script called mpirun_rrze: $ mpirun_rrze -np 4 ./hello.exe    # run 4 MPI processes

Use the -h option to mpirun_rrze to explore its features. For process-core affinity you can use the -pin option. E.g., to run 20 processes on 2 nodes with 5 processes per socket pinned to the first 5 physical cores:

$mpirun_rrze -np 20 -pin 0_1_2_3_4_10_11_12_13_14 ./hello.exe  This means you have to supply a list of system core IDs (as given, e.g., by likwid-topology) for all processes running on a node. The list applies to all nodes involved in the job. 1. MPI Ping-Pong. (30 credits) Write a simple "Ping-Pong" benchmark (as shon in the lecture) test to measure the unidirectional effective bandwidth (bytes transferred per second) and latency (time for transmission of a zero-byte message) of MPI communication on Emmy between two processes. To do this, you have to allocate two full nodes via the batch system (using -l nodes=2:ppn=40), and then start one process on each node. This is done via the "-pernode" option to mpirun: $ mpirun_rrze -pernode ./a.out

The Ping-Pong sends a message from one process to the other and then back (using the standard MPI_Send() and MPI_Recv() functions). The time for the whole back-and-forth transfer is measured on the process that calls MPI_Send() first. Make sure the whole transfer is repeated many times so that you get accurate timings. Submit your code along with your answers.

(a) Measure unidirectional internode effective bandwidth (data volume divided by the transfer time) for message sizes between 1 and 108 bytes. Obtain the transfer latency (time for transferring a zero-byte message) and the asymptotic bandwidth. Be aware that measuring a zero-byte transfer directly may not work because MPI may detect that you don't actually send anything and return right away.
(b) Repeat your measurements for intranode MPI communication. By selecting the cores that your processes run on you can distinguish between intersocket and intrasocket situations:

$mpirun -np 2 -pin 0_1 ./a.out # intrasocket$ mpirun -np 2 -pin 0_10 ./a.out # intersocket


Of course it is sufficient here to allocate only one node. Draw the effective bandwidths for both cases into the diagram from (a) and describe the basic differences in latency and asymptotic bandwidth between intrasocket, intersocket, and internode situations. Is intranode MPI "infinitely fast"?

2. Message aggregation. (20 credits) Someone has written a program for a parallel computer. This computer consists of "compute elements" (e.g., "nodes") that are connected via a non-blocking fat-tree Gigabit Ethernet network (latency: 30 µs, asymptotic bandwidth: 125 MByte/s). You can assume a simple bandwidth-latency model for describing the GigE links. The program sends and receives many messages of size 1000 bytes between the nodes when it runs. About 30% of the runtime is spent in communication. Determine if and why it makes sense to restructure the program so that multiple messages are sent in a single "big package." What is the expected performance gain when n messages are aggregated into one?

3. MPI-parallel raytracer. (50 credits) Parallelize the raytracer from Assignment 9 with MPI. This can be done with simple MPI calls, using the calc_tile() function as the basic unit of work. Submit your code together with your answers.

(a) Use an overall problem size of 8000x8000 pixels and report performance in MPixels/s on 1...40 physical cores (up to 2 nodes) of Emmy (i.e., strong scaling). What is the optimal tile size here? Why can tiles not be as small as in the OpenMP code?
(b) Add code to calculate the average gray value across all pixels using an MPI collective operation. What is the average gray value?

Hint: The easiest approach to MPI parallelization is a "master-worker" scheme where one process (the master) sends out work and receives finished tiles from the workers. There may be other solutions, though.