## Hands-On: Measuring the divide instruction/operation throughput

We want to calculate the value of $$\pi$$ by numerically integrating a function:

$$\displaystyle\pi=\int\limits_0^1\frac{4}{1+x^2}\,\mathrm dx$$

We use a very simple rectangular integration scheme that works by summing up areas of rectangles centered around xi with a width of Δx and a height of f(xi):

int SLICES = 2000000000;double delta_x = 1.0/SLICES;for (int i=0; i < SLICES; i++) {
x = (i+0.5)*delta_x;
sum += (4.0 / (1.0 + x * x));
}
Pi = sum * delta_x;

You can find example programs in C and Fortran in the DIV folder.
For this exercise you need to employ the gcc compiler:
$module load gcc $ gcc -Ofast -march=skylake-avx512 -mprefer-vector-width=512 div.c -o div.exe
or:
\$ gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 div.f90 -o div.exe
This compiles the code with the largest possible SIMD width on this CPU (512 bit).
Make sure that your code actually computes an approximation to π, and look at the runtime and performance in MFlops/s as obtained on one core of the cluster. How many flops per cycle are performed? (Hint: the clock speed is fixed at 3.0 GHz)

1. Assuming that the divide instruction dominates the runtime of the code (and everything else is hidden behind the divides), can you estimate the inverse throughput (i.e., the number of operations per cycle) for a divide operation in CPU cycles?
2. Now compile successively with the following architecture options:

-march=broadwell
-march=nehalem
-march=nehalem -fno-tree-vectorize
These produce AVX2, SSE, and scalar code, respectively.

How does the divide throughput change? Did you expect this result?