Exercise: Dense matrix-vector multiplication


Copy  the source files to your home directory via

$ cp -a ~j75n0000/DMVM ~

Get an interactive  job on the Emmy cluster with:

$ qsub -I -l nodes=1:ppn=40:f2.2:likwid  -l walltime=08:00:00

Set up the environment:

$ module load intel64
$ module load likwid/4.3.4


There are C and F90 source codes available in the C and F90 folders, respectively. Build the executable with:

$ icc -Ofast -xhost -std=c99 -o ./dmvm ./dmvm.c

Test if it is working:

$ likwid-pin -c S0:2 ./dmvm  5000 5000

There is a helper script ./bench.pl in the DMVM folder that that allows to scan data set size. Use it as follows:

$ ./bench.pl F90/dmvm <N columns>

You can generate a png plot of the result with gnuplot (only available on the frontend machines) with:

$ gnuplot bench.plot

The output is expected in bench.dat!

Performance Engineering cycle

  1. Code Analysis
  2. Develop performance expectation
  3. Benchmarking
  4. Performance profiling
  5. Optimization
  6. Validation


What do we expect based on the static code analysis? What does this mean for benchmark planning?

Set the number of columns to 10000 and scan the number of rows with (this should take less than two minutes):

./bench.pl F90/dmvm 10000 > bench.dat

What do we learn from the result? Is this what we expected? How can we measure what is going on?

Performance profiling

Instrument the source code with the LIKWID marker API.

Build the new version with:

icc -Ofast -xhost -std=c99 -DLIKWID_PERFMON  -o ./dmvm $LIKWID_INC ./dmvm-marker.c  $LIKWID_LIB -llikwid

Test you new version using:

likwid-perfctr  -C S0:3 -g MEM_DP -m ./dmvm 15000 10000

Repeat the scan of row count using the following command:

./bench-perf.pl F90/dmvm 10000  MEM

What is the result? Repeat for L3 and L2 groups.

What do we learn from profiling?

Optimization and Validation

What can we do about the performance drops?

Plan and implement an optimization called spatial cache blocking. Allow to configure for which target cache you block.

Repeat benchmarking only (not setting the -DLIKWID_PERFMON define)  and validate the results with profiling.

Going parallel

Parallelize both the initial and optimized version with OpenMP. Take care for the reduction on y!

Benchmark the results and scale out within one socket. What are the results?

Last modified: Tuesday, 8 October 2019, 11:44 AM