EPEEC application focus: assessing the power of tasks with data dependencies in OmpSs to optimize Particle-In-Cell simulation

Several applications are considered in EPEEC: AVBP (Cerfacs, a numerical simulation framework for the study of fluid dynamics and combustion problems), DIOGENeS (Inria, a numerical simulation framework for the study of nanoscale light-matter interaction problems), OSIRIS (INESC-ID, a numerical simulation framework for the study of plasma physics problems), Quantum ESPRESSO (Cineca, a set of numerical tools for the study of electronic properties of materials) and SMURFF (IMEC, a Bayesian matrix factorization framework for building recommender systems with applications to life sciences). The present news focuses on the OSIRIS application. 

Despite the strong promises of the task-based paradigm, as adopted in parallel programming models such as OpenMP and OmpSs, its effective advantages are far from being well understood when applied to the non-trivial programs that comprise real-world HPC applications. 

In the context of EPEEC, a recent collaboration between the INESC-ID and BSC teams has contributed to a better assessment of the advantages and limitations of tasks with data dependencies when used to parallelize the important class of particle-mesh applications. The case used for our study is a plasma physics kinetic simulation, based on an electromagnetic particle-in-cell (EM-PIC) method. This method is widely used for modeling many relevant plasma physics scenarios, ranging from high-intensity laser-plasma interaction to astrophysical shocks [1]. 

This effort has resulted in two main contributions.

Suite of parallel implementations

Firstly, different task-based implementations of a bare-bones version of the OSIRIS EM-PIC code [2], called ZPIC, have been developed based on the OmpSs-2 programming model. The different versions explore the task-based paradigm to different extents — ranging from its most basic to advanced features such as data dependencies. The suite of parallel implementations is available as open source code at EPEEC’s Github repository: https://github.com/epeec/zpic-epeec.

This suite of parallel implementations of ZPIC constitutes a useful benchmark to evaluate future advances in task-based programming tools and HPC hardware.

Experiments and results

As a second contribution, the different implementations were experimentally evaluated with realistic simulation workloads (namely, Laser Wakefield Accelerator and Collision of Plasma Clouds) on a shared-memory multicore processor. The experiments were performed on a computational node composed of two Intel Xeon Platinum 8160 CPUs with 24 physical cores @2.10GHz (total of 48 cores) and 96GB of RAM, running SUSE Linux. The obtained results show that a fully asynchronous implementation (i.e., using only data dependencies for synchronization) is able to achieve near perfect scaling for 48 cores, despite the unbalanced conditions. This impressive result is accomplished while retaining the code simplicity of task-based programming.  

Figure 1

Figure 1 summarizes the results of our evaluation. The performance of the conventional parallel for approach (zpic-parallel-for) greatly depends on the simulation conditions, showing good scaling for the Weibel test case, but very low speedups for the LWFA simulation. In contrast, all the task-based implementations have very consistent performance across test cases. Among them, zpic-tasklike has the worst speedups as this version relies on global barriers for synchronizing the tasks. In zpic-reduction-sync, we substitute all global barriers by task dependencies, except for the barrier at the very end of the iteration, allowing the tasks within the same iteration to be executed asynchronously. Finally, zpic-reduction-async only synchronizes tasks through the use of data dependencies, enabling a truly asynchronous execution. With an asynchronous execution, not only does the OmpSs-2 runtime have greater flexibility for balancing the load across the threads, but also synchronization only occurs between the required tasks, reducing the synchronization overhead. 

Conclusions

We developed and analysed a set of task-based implementations of an EM-PIC simulator as a way to contribute to a better understanding of the benefits and limitations of tasking models when applied to the broad class of particle-mesh codes. 

Our results confirm that tasking, when used with recent data dependencies features, enables the runtime to dynamically schedule highly asynchronous tasks, attaining near ideal scalability even with very irregular workloads. This impressive result is achieved while retaining the simplicity of the tasking model, thus providing the programmer with high coding productivity.

 

References

[1] Arber, T.D., Bennett, K., Brady, C.S., Lawrence-Douglas, A., Ramsay, M.G., Sircombe, N.J., Gillies, P., Evans, R.G., Schmitz, H., Bell, A.R., Ridgers, C.P.: Contemporary particle-in-cell approach to laser-plasma modelling. Plasma Physics and Controlled Fusion 57(11), 113001 (2015)

[2] Fonseca, R.A., Silva, L.O., Tsung, F.S., Decyk, V.K., Lu, W., Ren, C., Mori, W.B., Deng, S., Lee, S., Katsouleas, T., Adam, J.C.: Osiris: A three-dimensional, fully relativistic particle in cell code for modeling plasma based accelerators. In: Computational Science | ICCS 2002. pp. 342{351. Springer, Berlin, Heidelberg (2002)