Exercise: Dense matrix-vector multiplication
Copy the source files to your home directory via
$ cp -a ~j75n0000/DMVM ~
Get an interactive job on the Emmy cluster with:
$ qsub -I -l nodes=1:ppn=40:f2.2:likwid -l walltime=08:00:00
Set up the environment:
$ module load intel64
$ module load likwid/4.3.4
There are C and F90 source codes available in the C and F90 folders, respectively. Build the executable with:
$ icc -Ofast -xhost -std=c99 -o ./dmvm ./dmvm.c
Test if it is working:
$ likwid-pin -c S0:2 ./dmvm 5000 5000
There is a helper script ./bench.pl in the DMVM folder that that allows to scan data set size. Use it as follows:
$ ./bench.pl F90/dmvm <N columns>
You can generate a png plot of the result with gnuplot (only available on the frontend machines) with:
$ gnuplot bench.plot
The output is expected in bench.dat!
Performance Engineering cycle
- Code Analysis
- Develop performance expectation
- Performance profiling
What do we expect based on the static code analysis? What does this mean for benchmark planning?
Set the number of columns to 10000 and scan the number of rows with (this should take less than two minutes):
./bench.pl F90/dmvm 10000 > bench.dat
What do we learn from the result? Is this what we expected? How can we measure what is going on?
Instrument the source code with the LIKWID marker API.
Build the new version with:
icc -Ofast -xhost -std=c99 -DLIKWID_PERFMON -o ./dmvm $LIKWID_INC ./dmvm-marker.c $LIKWID_LIB -llikwid
Test you new version using:
likwid-perfctr -C S0:3 -g MEM_DP -m ./dmvm 15000 10000
Repeat the scan of row count using the following command:
./bench-perf.pl F90/dmvm 10000 MEM
What is the result? Repeat for L3 and L2 groups.
What do we learn from profiling?
Optimization and Validation
What can we do about the performance drops?
Plan and implement an optimization called spatial cache blocking. Allow to configure for which target cache you block.
Repeat benchmarking only (not setting the -DLIKWID_PERFMON define) and validate the results with profiling.
Parallelize both the initial and optimized version with OpenMP. Take care for the reduction on y!
Benchmark the results and scale out within one socket. What are the results?