Exercise: Matrix-free CG solver


Copy  the source files to your home directory via

$ cp -a ~j75n0000/MFCG ~

Get an interactive  job on the Emmy cluster with:

$ qsub -I -l nodes=1:ppn=40:f2.2:likwid  -l walltime=08:00:00

Set up the environment:

$ module load intel64
$ module load likwid/4.3.4


There is only a C source code available in the C folder. Build the executable with:

$ icc -Ofast -xhost -std=c99 -o ./mfcg ./mfcg.c

Test if it is working:

$ likwid-pin -c S0:2 ./mfcg  2000 20000

The problem size is specified as two numbers: The outer and the inner dimension. Performance-wise, this is important only for the stencil update part of the algorithm.

Performance Engineering 

Time profile

Compile the code with the -pg switch to instrument with gprof:

$ icc -Ofast -xhost -std=c99 -pg -o ./mfcg ./mfcg.c
After running the the code you end up with a gmon.out file, which can be converted to readable form by the gprof tool

$ gprof ./mfcg

The result is a "flat profile" and a "butterfly graph" of the application.

Performance profiling

Instrument the source code with the LIKWID marker API.

Build the new version with:

icc -Ofast -xhost -std=c99 -DLIKWID_PERFMON  -o ./mfcg $LIKWID_INC ./mfcg-marker.c  $LIKWID_LIB -llikwid

Test you new version using:

likwid-perfctr  -C S0:3 -g MEM_DP -m ./mfcg 2000 20000

What do we learn from profiling?

Analysis, Optimization and Validation

By looking at the code, can you reconcile the measured computational intensity of likwid-perfctr for each loop with a manual analysis? 

Can you think of an optimization that would improve the performance? Repeat benchmarking only (not setting the -DLIKWID_PERFMON define)  and validate the results with profiling.

Going parallel

Parallelize both the initial and optimized version with OpenMP. Does your code behave in accordance with a Roofline model?

Last modified: Tuesday, 8 October 2019, 2:39 PM