Stale Synchronous Parallel (SSP) is a synchronization model proposed to speed up iterative convergent Machine Learning algorithms in distributed settings. In this model, synchronization among workers is reduced by allowing workers to see different intermediate solutions that can be a bounded number of iterations out of date (bounded staleness). With the advent of Remote Direct Memory Access (RDMA), one-sided communication has become a popular alternative to two-sided communication in asynchronous environments. Although SSP is inherently asynchronous, to the best of our knowledge no SSP solutions are using one-sided communication. The goal of this thesis is to create a solution to SSP that takes advantage of RDMA's support for one-sided communication and to provide it to application programmers through a new weakly consistent collective abstraction developed using the GASPI API. To this end, we designed and implemented two different solutions ranging from directly adapting an existing synchronous allreduce algorithm to support SSP, to using the ideas behind a Parameter Server architecture, and running the Parameter Server shards directly on the nodes performing the collective. Our solutions were evaluated on the MareNostrum4 supercomputer, using up to 64 nodes, and evaluated under two implementations of the Matrix Factorization algorithm, one being our own, and the other a real-world implementation. Using our proposed collective we were able to reduce the collective execution time by up to 2.5x when compared to MPI's allreduce while having minimal impact on the convergence rate of the algorithms tested.
Tom Vander Aa, Xiangju Qin, Paul Blomstedt, Roel Wuyts, Wilfried Verachtert, Samuel Kaski. A High-Performance Implementation of Bayesian Matrix Factorization with Limited Communication. International Conference on Computational Science (ICCS 2020).
This paper is included in the Public Health Emergency #COVID19 Initiative repository
Jaume Bosch, Carlos Álvarez, Daniel Jiménez-González, Xavier Martorell, Eduard Ayguadé. Asynchronous runtime with distributed manager for task-based programming models. Parallel Computing, Volume 97, 2020. https://doi.org/10.1016/j.parco.2020.102664
Gureya, D., Neto, J., Karimi, R., Barreto, J, Bhatotia, P., Quema, V., Rodrigues, R., Romano, P., Vlassov, V. Bandwidth-Aware Page Placement in NUMA. 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, 2020 pp. 546-556. doi: 10.1109/IPDPS47924.2020.00063
Bosch, J. [et al.]. Breaking master-slave model between host and FPGAs. A: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. "PPoPP'20: Proceedings of the 2020 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming: San Diego, CA, USA: 22-26 February 2020". New York: Association for Computing Machinery (ACM), 2020, p. 419-420. ISBN 978-1-4503-6818-6. DOI 10.1145/3332466.3374545.
Antonio J. Peña. EPEEC’s Advances toward Programming Productivity for Heterogeneity at Large Scale. EuroExaScale 2020 (HiPEAC 2020 Conference).
Pavanakumar Mohanamuraly and Gabriel Staffelbach. 2020. Hardware Locality-Aware Partitioning and Dynamic Load-Balancing of Unstructured Meshes for Large-Scale Scientific Applications. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’20). Association for Computing Machinery, New York, NY, USA, Article 7, 1–10. DOI:https://doi.org/10.1145/3394277.3401851
Orestis R. Korakitis. Towards supporting Composability of Directive-based Programming Models for Heterogeneous Computing. 2020
With the increased effort to make supercomputers reach new levels of performance, it is crucial to investigate which are the most effective programming models to use in High-Performance Computing (HPC). Today, the most used frameworks in HPC are MPI, which follows a message passing paradigm, and OpenMP, a shared memory framework. Throughout the years, alternative frameworks have been emerging, such as GASPI and OmpSs. GASPI is an implementation of partitioned global address space, an approach of distributed shared memory paradigm which uses one-sided communications. OmpSs works on shared memory using tasks with data dependencies. The goal of this thesis is to compare the different programming models. To do so we started from a sequential, open source and simplified version of OSIRIS, a particle-in-cell code in the plasma simulation filed, called ZPIC. From ZPIC, we built different implementations using the different above-mentioned emerging programming models. The different versions were experimentally evaluated using three real test cases that use different capacities of ZPIC. We used the supercomputer MareNostrum, reaching the usage of 12,288 cores in our experiments. The results show that the strategy implemented with OmpSs achieved better results than OpenMP versions and that GASPI shows promising results.