Hands-On: A matrix-free CG solver

In this hands-on we analyze a conjugate-gradient (CG) linear solver which is derived from the popular HPCG benchmark. It is "matrix free" because it does not actually store the coefficient matrix; the matrix is hard-coded so that the sparse MVM kernel is actually a Jacobi-like stencil update scheme.

Preparation

Copy  the source files to your home directory via 

$ cp -a ~ghager/MFCG ~

Build and run

There  are C and F90 source codes available in the C/ and F90/ folders. Build the executable from C with:

$ icc -Ofast -xHost -std=c99 -o ./mfcg ./mfcg.c

In case of Fortran you need:

$ icc -c timingC.c
$ ifort -Ofast -xHost -o ./mfcg ./mfcg.f90 timingC.o

Test if it is working:

$ likwid-pin -c S0:2 ./mfcg  8000 20000

The problem size is specified as two numbers: The outer (first argument) and the inner dimension (second argument) of the grid. Performance-wise, this is important only for the stencil update part of the algorithm. The code implements a standard Conjugate-Gradient (CG) algorithm without a stored matrix, i.e., the matrix-vector multiplication is a stencil update sweep. Performance is printed in millions of lattice site updates per second. This number pertains to the whole algorithm, not just the stencil update.

Performance Engineering 

Time profile

Compile the code with the -pg switch to instrument with gprof (works fine with Fortran, too):

$ icc -Ofast -xhost -std=c99 -pg -o ./mfcg ./mfcg.c
After running the the code you end up with a gmon.out file, which can be converted to readable form by the gprof tool

$ gprof ./mfcg

The result is a "flat profile" and a "butterfly graph" of the application. look at the flat profile and think about what went wrong here. Fix it and do the profile again. 

What is the "hot spot" of the program?

-- END of part 1 ----------------------------------------------------------------------------------------------------------------

OpenMP parallelization

The code already has OpenMP directives. Add the "-qopenmp" option to the compiler command line to activate OpenMP. Run the code on the 24 physical cores of one VSC-4 socket. What is the performance on the full socket as compared to the serial version? Do you have a hypothesis about what the bottleneck might be?

Do and end-to-end performance counter measurement of the memory bandwidth of the program:

$ likwid-perfctr -C S0:0-23 -g MEM ./mfcg 8000 20000

Remembering what you learned about the memory bandwidth of a socket in the hands-on about the BWBENCH code, is your hypothesis confirmed?

Make a Roofline model of the hot spot loop you have identified above.

Performance profiling

Instrument the hot spot loop in the source code with the LIKWID marker API (be guided by the MiniMD or the BWBENCH code, which already contain markers). Note that the header file is called "likwid-marker.h" since LIKWID version 5. For threaded code, LIKWID_MARKER_START() must be called from every executing thread!
(Fortran programmers please look at https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerF90).

Build the new version with (you can omit the -DLIKWID_PERFMON for Fortran):

icc -Ofast -xhost -std=c99 -DLIKWID_PERFMON  -o ./mfcg $LIKWID_INC ./mfcg-marker.c  $LIKWID_LIB -llikwid

Test your new version using:

likwid-perfctr  -C S0:0-23 -g MEM -m ./mfcg 8000 20000

Does the hot spot loop run according to your Roofline prediction? Validate your model by checking your data traffic prediction.

If you have the time you can continue the analysis with the other important loops in the code. Does your code's performance scale from one socket to two?

Analysis, Optimization, and Validation

Can you think of an optimization that would improve the performance of the entire algorithm? Think about what the overall bottleneck is and how you can reduce the "pressure" on this bottleneck.

You don't have to go all the way with this. Implement something of which you know it will improve performance. Try to validate the effect with measurements.







Last modified: Friday, 12 March 2021, 2:26 PM