In a previous news piece published in March 2020, we talked about the need and importance of collective operations for scientific computations, which rely on domain decompositions and, hence, distributed computing. Collectives can be viewed as common communication patterns that are often involved in exchanging/aggregating data from a group of processes. We also introduced the concept of eventual consistency, naturally inherited from the GRID computing.
In this post, we would like to update the reader with our progress as well as the preliminary performance results. Thus, we propose a design for eventually consistent collectives suitable for ML/ DL computations by reducing communication in Broadcast and Reduce, as well as by exploring the Stale Synchronous Parallel (SSP) synchronization model for the Allreduce collective.
Fig. 1. On the left hand side of the figure we see the communication pattern on a hypercube using eight processes, and in the middle part, we have an example of the adaptation of the hypercube to use SSP.
Fig. 2. Experimental results of allreduce_SSP impact on convergence speed of the ML algorithm (the Matrix Factorization algorithm using Stochastic Gradient Descent) on 32 nodes of the MareNostrum4 cluster.
Moreover, we also enrich the GASPI ecosystem with frequently used classic/ consistent collective operations – such as Allreduce for large messages and AlltoAll used in an HPC code.
Fig. 3. Performance results of Allreduce on SkyLake nodes at Fraunhofer: (on left) on 32 nodes for various message sizes; (on right) for vectors of 1,000,000 elements. gaspi corresponds to the segmented pipelined ring with GASPI (gaspi_allreduce_ring), while mpiX to one of 12 different MPI implementations from Intel MPI v18.1.
Fig. 4. Performance results of gaspi_alltoall compared against MPI on the Galileo cluster at CINECA; GPI-2 installation is from the next branch on GitHub, which is v1.4.0, and MPI implementation comes from the Intel MPI v18.0 library. Since we aim to have a hybrid programming model implementation of CINECA’s FFT solver, using AlltoAll, we set four GASPI/ MPI processes per node and run our experiments on 4, 8, and 16 nodes; this is marked as gaspiX and mpiX.
Our implementations show promising preliminary results with significant improvements, especially for Allreduce and AlltoAll, compared to the vendor-provided MPI alternatives.
Our implementations of collectives are available at the EPEEC’s GitHub repository: htttps://github.com/EPEEC