Hands-On: Dense matrix-vector multiplication

In this hands-on we will explore performance properties of dense matrix-vector multiplication: 

        for (int c=0; c<N_cols; c++) {
            for (int r=0; r<N_rows; r++) {
                y[r] = y[r] + a[c*N_rows+r] * x[c];
            }
        }


Preparation

Copy  the source files to your home directory via:

$ cp -a ~ghager/DMVM ~


Preparation

There are C and F90 source codes available in the C and F90 folders, respectively. Build the executable with:

$ icc -Ofast -xHost -qopt-zmm-usage=high -std=c99 -o ./dmvm ./dmvm.c

For Fortran you have to compile the timing code and link it to the program:

$ icc -c timingC.c
$ ifort -Ofast -xHost -qopt-zmm-usage=high -o ./dmvm ./dmvm.f90 timingC.o 

Test if it is working:

$ likwid-pin -c S0:2 ./dmvm  5000 5000

The output shows the number of repetitions, the problem size, and the performance in Mflop/s. There is a helper script ./bench.pl in the DMVM folder that that allows to scan data set size. Use it as follows (here using the compiled Fortran code as an example):

$ ./bench.pl F90/dmvm <N columns>

It keeps the number of columns constant (as given - we recommend something between 5000 and 10000) and scans the number of rows from 1000 to 100000. It stops if the overall working set is larger than 2 GB. If you pipe the output of bench.pl into the file bench.dat, you can generate a png plot of the result with gnuplot (only available on the frontend machines) with:

$ gnuplot bench.plot

Remember that the input data for this gnuplot script is expected in bench.dat!


    Benchmarking

    What do we expect based on the static code analysis? What does this mean for benchmark planning?

    Set the number of columns to 10000 and scan the number of rows with (this should take less than two minutes):

    ./bench.pl C/dmvm 10000 | tee bench.dat

    What do we learn from the result? Is this what we expected? How can we measure what is going on?

    Performance profiling

    Instrument the source code with the LIKWID marker API.

    Build the new version with:

    icc -Ofast -xhost -qopt-zmm-usage=high -std=c99 -DLIKWID_PERFMON  -o ./dmvm $LIKWID_INC ./dmvm-marker.c  $LIKWID_LIB -llikwid

    or:

    ifort -Ofast -xhost -qopt-zmm-usage=high -o ./dmvm $LIKWID_INC ./dmvm-marker.f90  $LIKWID_LIB -llikwid timingC.o

    Test your new version using:

    likwid-perfctr  -C S0:3 -g MEM_DP -m ./dmvm 15000 10000

    Repeat the scan of row count using the following command (the last option is a LIKWID metric group name):

    ./bench-perf.pl C/dmvm 10000  MEM

    The output of this has 3 columns: The number of matrix rows, the performance, and the number of bytes transferred to and from memory (if using the MEM group) per iteration. If you specify "L2" or "L3," you get the traffic to the L2 and L3 cache, respectively. 

    If you want you can modify the bench.plot script (change "1:2" on the last line to "1:3") to get plots of data traffic (bytes per iteration) versus number of rows. 

    What are your observations? Can you correlate the observed performance drops with traffic behavior? Which data structure is responsible for this?


    Optimization and Validation

    What can we do about the performance drops?

    Plan and implement an optimization called spatial cache blocking. Allow to configure for which target cache you block.

    Repeat benchmarking only (not setting the -DLIKWID_PERFMON define)  and validate the results with profiling.

    Going parallel

    Parallelize both the initial and optimized version with OpenMP. Take care for the reduction on y!

    Benchmark the results and scale out within one socket. What are the results?




    Last modified: Thursday, 11 March 2021, 2:38 PM