## Measuring the divide operation/instruction throughput

We want to numerically integrate the function

f(x) = 4/(1+x

^{2})from 0 to 1. The result should be an approximation to π, of course. You may use a very simple rectangular integration scheme that works by summing up areas of rectangles centered around x

You can find (not quite complete) example programs in C and Fortran in the DIV folder. Make sure that your code actually computes an approximation to π, and report runtime and performance in MFlops/s as obtained on one core of the cluster. How many flops per cycle are performed?

_{i}with a width of Δx and a height of f(x_{i}):int SLICES = 1000000000;

double delta_x = ....;

for (int i=0; i < SLICES; i++) {x = (i+0.5)*delta_x;sum += (4.0 / (1.0 + x * x));}Pi = sum * delta_x;

You can find (not quite complete) example programs in C and Fortran in the DIV folder. Make sure that your code actually computes an approximation to π, and report runtime and performance in MFlops/s as obtained on one core of the cluster. How many flops per cycle are performed?

- Assuming that the divide instruction dominates the runtime of the code (and everything else is hidden behind the divides), can you estimate the inverse throughput (i.e., the number of operations per cycle) for a divide operation in CPU cycles?
- In the Makefile we have deactivated SIMD vectorization for the Intel compiler via the -no-vec option. If you delete this option, the compiler will use AVX (32-byte) SIMD instructions. How does the divide throughput change? Did you expect this result?
- Parallelize the code using OpenMP. Run it with different numbers of cores on one socket (if you know already how to do this). Does the performance scale with the number of cores? Did you expect this result?
- Can you modify the code so that it estimates the divide instruction
*latency*instead of the inverse throughput?

Last modified: Tuesday, 6 November 2018, 10:57 PM