MPI: data-rep - how to avoid replicated data
purpose of this exercise :
Now you write your first hybrid code (MPI + MPI 3.0 shared memory) and, thus, learn how to avoid data replication by using MPI 3.0 shared memory.
what the program is supposed to do :
For further computation all ranks need to know the same large array. In this exercise the ranks are filled with consecutive numbers. In order to be able to check if the values were transferred (bcast) or stored/read (shared memory) in a correct way, we compute something we know, namely simply the sum of all the array elements, the result of which we know.
In the here provided base-code we are using MPI, only, therefore we are forced to replicate the same array on all the ranks which really wastes a lot of memory. Therefore you rewrite the code by means of MPI 3.0 shared memory, starting from the provided skeleton.
prepare for these exercises :
cd ~/HY-VSC/<PC#>/data-rep/C-data-rep # change into your C directory .OR. …
(Outside of the course please see slides for working directory and copying of files.)
use this program as baseline for your my_shared_exa3.c/_30.f90 :
data-rep_exercise.c # C
data-rep_exercise_30.f90 # Fortran
The exercises are also described on slide 133.
data-rep_base : data is replicated within one node
rank = 0 in MPI_COMM_WORLD broadcasts an array to all processes in MPI_COMM_WORLD. All ranks compute the sum of the elements of that array.
data-rep_exercise : avoiding replicated data within one node
In order to save memory consumption the array is stored into a shared memory window only once per node (physical shared memory island). The array is broadcasted to only one rank per node.
- for the shared memory islands: comm_shm, rank_shm, size_shm (MPI_Comm_split_type / MPI_COMM_TYPE_SHARED)
- across nodes, including all rank_shm=0 (MPI_Comm_split / color = 0 for all rank_shm=0, for all other ranks color = MPI_UNDEFINED)
shared memory :
- allocate the shared memory for rank_shm = 0 → in the size of the array (MPI_Win_allocate_shared)
- all other ranks define length zero for the shared memory. Then the pointer to the shared memory is not defined. Call MPI_Win_shared_query in order to obtain the starting address of the shared array. (Only in the situation when all ranks allocate non-zero-sized shared memory, the individual shared memory portions are contiguous and the pointers to the shared memory portions of the other ranks can be computed by local information, only.)
remark: in Chapter 11 of the MPI-course all processes reserved equally sized shared memory portions. For the purpose of the exercise at hand this is not convenient and may further lead to problems if the shared memory islands in the communicator have non-equal numbers of processes.
hint: compare the slides: 123, 125 (MPI_Comm_split_type, MPI_Win_allocate_shared, direct assignments within MPI_Win_fences), 129 (MPI_Win_shared_query)
broadcast between the heads :
- MPI_Bcast can be called only by ranks that are inside of comm_head, for the other ranks comm_head is not defined
use of the shared memory :
- take care where to put the memory fences
the output of the solution:
- the base pointers to the shared memory portions of the individual processes: this shows that all base pointers but those of the head processes of the shared memory islands are pointing to (nil). This is because only the head processes allocated shared memory of finite size.
- it tells us the number of shared memory islands and the smallest and largest number of processes in one island.
- there is a "it"-loop over the calculation of the sum. This should symbolise different steps of an iteration in a real world application. The values of the array are modified in between too iterations. The printed values state the iteration it, the ranks w.r.t. different communicators, and the sum