Type of publication

Amândio Faustino

Técnico Lisboa
Year of publication

Stale Synchronous Parallel (SSP) is a synchronization model proposed to speed up iterative convergent Machine Learning algorithms in distributed settings. In this model, synchronization among workers is reduced by allowing workers to see different intermediate solutions that can be a bounded number of iterations out of date (bounded staleness).  With the advent of Remote Direct Memory Access (RDMA), one-sided communication has become a popular alternative to two-sided communication in asynchronous environments. Although SSP is inherently asynchronous, to the best of our knowledge no SSP solutions are using one-sided communication. The goal of this thesis is to create a solution to SSP that takes advantage of RDMA's support for one-sided communication and to provide it to application programmers through a new weakly consistent collective abstraction developed using the GASPI API. To this end, we designed and implemented two different solutions ranging from directly adapting an existing synchronous allreduce algorithm to support SSP, to using the ideas behind a Parameter Server architecture, and running the Parameter Server shards directly on the nodes performing the collective. Our solutions were evaluated on the MareNostrum4 supercomputer, using up to 64 nodes, and evaluated under two implementations of the Matrix Factorization algorithm, one being our own, and the other a real-world implementation. Using our proposed collective we were able to reduce the collective execution time by up to 2.5x when compared to MPI's allreduce while having minimal impact on the convergence rate of the algorithms tested.