Exercise: Matrix-free CG solver
Preparation
Copy the source files to your home directory via
$ cp -a ~j75n0000/MFCG ~
Get an interactive job on the Emmy cluster with:
$ qsub -I -l nodes=1:ppn=40:f2.2:likwid -l walltime=08:00:00
Set up the environment:
$ module load intel64
$ module load likwid/4.3.4
Preparation
There is only a C source code available in the C folder. Build the executable with:
$ icc -Ofast -xhost -std=c99 -o ./mfcg ./mfcg.c
Test if it is working:
$ likwid-pin -c S0:2 ./mfcg 2000 20000
The problem size is specified as two numbers: The outer and the inner dimension. Performance-wise, this is important only for the stencil update part of the algorithm.
Performance Engineering
Time profile
Compile the code with the -pg switch to instrument with gprof:
$ icc -Ofast -xhost -std=c99 -pg -o ./mfcg ./mfcg.cAfter running the the code you end up with a gmon.out file, which can be converted to readable form by the gprof tool
$ gprof ./mfcg
The result is a "flat profile" and a "butterfly graph" of the application.
Performance profiling
Instrument the source code with the LIKWID marker API.
Build the new version with:
icc -Ofast -xhost -std=c99 -DLIKWID_PERFMON -o ./mfcg $LIKWID_INC ./mfcg-marker.c $LIKWID_LIB -llikwid
Test you new version using:
likwid-perfctr -C S0:3 -g MEM_DP -m ./mfcg 2000 20000
What do we learn from profiling?
Analysis, Optimization and Validation
By looking at the code, can you reconcile the measured computational intensity of likwid-perfctr for each loop with a manual analysis?
Can you think of an optimization that would improve the performance? Repeat benchmarking only (not setting the -DLIKWID_PERFMON define) and validate the results with profiling.
Going parallel
Parallelize both the initial and optimized version with OpenMP. Does your code behave in accordance with a Roofline model?