Computational Science Asked on April 18, 2021
I am running a CFD simulation with a 200,000-vertex mesh. I’ve decomposed the mesh into 2 load-balanced sub-domains to test my parallel implementation. In the specific function that I am time-profiling, each sub-domain has to send a 3D gradient vector of MPI_DOUBLE
s for each of the ~2000 vertices that lie on the parallel communication boundary. The vertex list is sorted such that all vertices involved in parallel communication lie at the start of the list (ivertParaCommStart is the parallel vertex in the list with the largest index). The following is a simplified version of my code:
MPI_Request sendRequ; // Local variable in each thread
MPI_Request recvRequ; // Local variable in each thread
for(ivert, mvert)
{
// Perform costly calculations for each vertex.....
if(ivert == ivertCommStart)
{
// Load data from ALL parallel vertices and communicate with other thread in one go.
MPI_Isend(..., ..., MPI_DOUBLE, (ithread == 0 ? 1 : 0), ..., ..., &sendRequ);
MPI_Irecv(..., ..., MPI_DOUBLE, (ithread == 0 ? 1 : 0), ..., ..., &recvRequ);
}
else if(ivert > ivertCommStart)
{
MPI_Test(&sendRequ, ..., ...);
MPI_Test(&recvRequ, ..., ...);
}
}
MPI_Wait(&sendRequ, ...); // Send is still not completed at this point
MPI_Wait(&recvRequ, ...); // Recv is still not completed at this point
Considering that the total number of parallel communication vertices is ~4000 in each sub-domain (~4% of the total number of vertices) and considering the cost of the computations performed for each vertex, I would have expected that the data transfer would be completely masked by the computations performed over the remaining non-parallel vertices. However, this is not the case. Note that after the MPI_Wait
calls do eventually return, I have confirmed that the data exchanged is as I would expect (parallel simulations produce equivalent results to a serial simulation). However, the cost of the MPI_Wait()
calls yields very poor scaling of my code. Can anyone advise me on why MPI_Test()
is not progressing my send and receive requests?
EDIT: Apologies for any confusion caused – I should clarify that I communicate the data for all parallel vertices in one go.
EDIT 2: I have found that the MPI_Test
calls do not allow me to overlap the communication and computation. They have a significant overhead. Effectively, the total time for the for
loop + MPI_Wait
calls is the same as if I wait till the end of the for
loop and then communicate the parallel data using blocking send/recv calls (the code below takes the same amount of time to run as the code above). So I am seeing no benefit of the non-blocking communication. I am relatively new to MPI
. I would appreciate any advice on what could be going on here.
for(ivert, mvert)
{
// Perform costly calculations for each vertex.....
}
// Load data from ALL parallel vertices and communicate with other thread.
MPI_Request sendRequ;
MPI_Request recvRequ;
MPI_Isend(..., ..., MPI_DOUBLE, (ithread == 0 ? 1 : 0), ..., ..., &sendRequ);
MPI_Irecv(..., ..., MPI_DOUBLE, (ithread == 0 ? 1 : 0), ..., ..., &recvRequ);
MPI_Wait(&sendRequ, ...);
MPI_Wait(&recvRequ, ...);
```
MPI_Test is a local operation. Do MPI_Iprobe instead. It can also be that your MPI allows you to control asynchronous progress with enviroment variables. Intel MPI does that, I don't know about others.
Things may also depend on your network cards.
Answered by Victor Eijkhout on April 18, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP