As high-performance systems evolve towards exascale and beyond, software development becomes increasingly more complex due to the rapid growth in the number of cores available in a single system. Many alternative programming models were proposed recently aiming to ease software development while increasing application efficiency. The main objective of this thesis is to evaluate the limitations and advantages of one of these new programming models (OmpSs-2) when applied to a real complex application. Our starting point is a sequential, 2D plasma simulator (ZPIC) that has the same core algorithm and functionalities as OSIRIS. In the first part of this thesis, we follow a spatial decomposition to parallelize ZPIC and target multicore CPUs. After applying a dynamic load balancing, our implementation not only achieves near-perfect scaling in one node of MareNostrum4 but also shows very consistent performance across all simulations. In the second part of this thesis, we target GPUs using a combination of OmpSs-2 and OpenACC. To efficiently use the device architecture, we introduce major changes to the ZPIC’s algorithm, including sorting the particles by tiles, using shared memory as cache and restructuring the particles’ data for coalesced memory accesses. The final implementation running on a single NVIDIA V100 GPU achieves up to 20x the performance of two IBM Power9 besides demonstrating excellent scaling for 2 GPUs as well as potential to scale up to 4 accelerators.
Type of publication
Year of publication