Type of publication
Publication in Conference Proceedings/Workshop
Authors

Collective operations are common features of parallel programming models that are frequently used in High-Performance (HPC) and machine/ deep learning (ML/ DL) applications. In strong scaling scenarios, collective operations can negatively impact the overall application performance: with the increase in core count, the load per rank decreases, while the time spent in collective operations increases logarithmically.In this article, we propose a design for eventually consistent collectives suitable for ML/ DL computations by reducing communication in Broadcast and Reduce, as well as by exploring the Stale Synchronous Parallel (SSP) synchronization model for the Allreduce collective. Moreover, we also enrich the GASPI ecosystem with frequently used classic/ consistent collective operations - such as Allreduce for large messages and AlltoAll used in an HPC code. Our implementations show promising preliminary results with significant improvements, especially for Allreduce and AlltoAll, compared to the vendor-provided MPI alternatives.

Conference / Journal
2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Publisher
IEEE
Year of publication
2021
Place of publication
Portland, OR, USA
Citation

R. Iakymchuk et al., "Efficient and Eventually Consistent Collective Operations," 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2021, pp. 621-630, doi: 10.1109/IPDPSW52791.2021.00096.

DOI
10.1109/IPDPSW52791.2021.00096