MPI: MPIX_Cart_weighted_create - application aware Cartesian topology
purpose of this exercise:
The exercise at hand benchmarks the communication time depending on the decomposition of a volume and the distribution of the processes on the hardware grid.
domain decomposition: assume a volume split into finite mesh cells where, e.g., differential equations will be solved. The variables in each of the cells are depending on neighbouring cells. For the parallel computation, the volume is split into smaller blocks. Then the surfaces of each block need to be updated by information from the neighbouring processes (=halo cells). Updating the halo cells means communication between the processes.
At this point I would like to point to the different decompositions we have. At first, there are the finite mesh cells that come into play when solving the differential equations. This mesh is done independently of the distribution to the hardware that we are dealing with during this exercise. It is kind of physical mesh, inherent to the problem we are solving. The objective of this exercise is to cut the whole volume into pieces and distribute those in a smart way on the hardware, i.e. in terms of a cartesian grid on the hardware. This decomposition can be done in multi level (_ml_) fashion, thus, accounting for different hardware levels: nodes, sockets, processes, hyperthreads, ...
aim of the decomposition and distribution to the hardware:
= minimisation of communication time.
1) node-node communication is much slower than intra-node communication.
2) non-optimal decomposition can lead to flat volumes involving a far larger surface and thus total number of halo cells per process.
(1) MPI_Dims_create factorises the original communicator as close as possible to a square/cube/hypercube independent of the dimensions of the application data mesh. step (1) may be non optimal if the application data mesh is not close to a square/cube/hypercube (imagine a physically long stick or thin plate in reality placed on a cubic hardware grid)
(2) MPI_Cart_create (comm_old, ndims, dims, periods, reorder, comm_cart)
after step (2) the placement on the hardware may be non optimal, the parameter reorder is not implemented in many MPI libraries
new approach: the MPIX routines
MPI_Dims_create ( ) + MPI_Cart_create ( ) → MPIX_Cart_weighted_create ( ... weights ....)
MPIX_Cart_weighted_create computes a multi-level factorisation of the communicator & involves rank re-numbering in order to place the ranks on the different hardware levels in a more convenient way.
If the communication bandwidth of the application is isotropic, MPIX_WEIGHTS_EQUAL fits well, if it is anisotropic, then the respective weights can be given.
alternatively: MPI_Dims_create ( ) + MPI_Cart_create → MPIX_Dims_ml_create + MPIX_Cart_ml_create
prepare for these exercises:
cd ~/HY-VSC/<PC#>/MPIX_Cart_weighted_create # change into your C. sorry, it is available for C, only
(Outside of the course please see slides for working directory and copying of files.)
The exercises are described on the slides 85-96.
the job scripts:
ls *.sh # OUTPUT: halo_skel_VSC.sh , halo_optim_VSC.sh
# sample job files, you'll have to adapt:
# jobname, number_of_nodes
# the line in the job script echo $SLURM_JOB_NODELIST prints out the node names that have been used for the calculation. It may also have an effect on the communication where the nodes are located inside the infiniband network, i.e., how many levels have to be crossed.
first: test the skeleton
# compile it, send it to the queue, analyse the results ( slurm-.....out )
mpiicc -o halo_skel.exe halo_irecv_send_toggle_3dim_grid_skel.c MPIX_*.c
then: use the MPIX_... routines
# copy the skeleton to a new file and optimise that one following the " TODOs "
# play around with the input file & analyse the results: see the slides
# for the input file compare silde 87
# data mesh size: means the ratio between the given volume mesh cells in the different directions (not distribution to hardware)
cp halo_irecv_send_toggle_3dim_grid_skel.c halo_optim.c
mpiicc -o halo_optim.exe halo_optim.c MPIX_*.c
measure the time and compare for the different cart_methods and the largest message size