Hands-On: A matrix-free CG solver
In this hands-on we analyze a conjugate-gradient (CG) linear solver which is derived from the popular HPCG benchmark. It is "matrix free" because it does not actually store the coefficient matrix; the matrix is hard-coded so that the sparse MVM kernel is actually a Jacobi-like stencil update scheme.
Copy the source files to your home directory via
$ cp -a ~ghager/MFCG ~
Build and run
There are C and F90 source codes available in the C/ and F90/ folders. Build the executable from C with:
$ icc -Ofast -xHost -std=c99 -o ./mfcg ./mfcg.c
In case of Fortran you need:
$ icc -c timingC.c $ ifort -Ofast -xHost -o ./mfcg ./mfcg.f90 timingC.o
Test if it is working:
$ likwid-pin -c S0:2 ./mfcg 8000 20000
problem size is specified as two numbers: The outer (first argument)
and the inner dimension (second argument) of the grid. Performance-wise,
this is important only for the stencil update part of the algorithm.
The code implements a standard Conjugate-Gradient (CG) algorithm without
a stored matrix, i.e., the matrix-vector multiplication is a stencil
update sweep. Performance is printed in millions of lattice site updates
per second. This number pertains to the whole algorithm, not just the
Compile the code with the -pg switch to instrument with gprof (works fine with Fortran, too):
$ icc -Ofast -xhost -std=c99 -pg -o ./mfcg ./mfcg.cAfter running the the code you end up with a gmon.out file, which can be converted to readable form by the gprof tool
$ gprof ./mfcg
The result is a "flat profile" and a "butterfly graph" of the application. look at the flat profile and think about what went wrong here. Fix it and do the profile again.
What is the "hot spot" of the program?
END of part 1
The code already has OpenMP directives. Add the "-qopenmp" option to the compiler command line to activate OpenMP. Run the code on the 24 physical cores of one VSC-4 socket. What is the performance on the full socket as compared to the serial version? Do you have a hypothesis about what the bottleneck might be?
Do and end-to-end performance counter measurement of the memory bandwidth of the program:
$ likwid-perfctr -C S0:0-23 -g MEM ./mfcg 8000 20000
what you learned about the memory bandwidth of a socket in the hands-on
about the BWBENCH code, is your hypothesis confirmed?
Make a Roofline model of the hot spot loop you have identified above.
the hot spot loop in the source code with the LIKWID marker API (be
guided by the MiniMD or the BWBENCH code, which already contain markers). Note that
the header file is called "likwid-marker.h" since LIKWID version 5. For
threaded code, LIKWID_MARKER_START() must be called from every
(Fortran programmers please look at https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerF90).
Build the new version with (you can omit the -DLIKWID_PERFMON for Fortran):
icc -Ofast -xhost -std=c99 -DLIKWID_PERFMON -o ./mfcg $LIKWID_INC ./mfcg-marker.c $LIKWID_LIB -llikwid
Test your new version using:
likwid-perfctr -C S0:0-23 -g MEM -m ./mfcg 8000 20000
Does the hot spot loop run according to your Roofline prediction? Validate your model by checking your data traffic prediction.
If you have the time you can continue the analysis with the other important loops in the code. Does your code's performance scale from one socket to two?
Analysis, Optimization, and Validation
Can you think of an optimization that would improve the performance of the entire algorithm? Think about what the overall bottleneck is and how you can reduce the "pressure" on this bottleneck.
You don't have to go all the way with this. Implement something of which you know it will improve performance. Try to validate the effect with measurements.