In the EPEEC project we have developed the integration of OpenACC tasks on the OmpSs-2 programming model. In this way, we can write applications with OmpSs-2 + OpenACC and execute the OpenACC code on GPUs in a more flexible way than pure OpenACC. The reason for this flexibility is that it is the OmpSs-2 runtime system who manages data transfers and task scheduling among the GPU devices, compared to a fixed assignment of work to GPUs in the pure OpenACC environment.
We have used the simplified version of OSIRIS [1], ZPIC [2], part of the EPEEC applications suite, in order to evaluate the OmpSs-2 + OpenACC implementation. OSIRIS is a popular electromagnetic particle-in-cell (EM-PIC) application. The ZPIC version is a purely sequential, bare-bones EM-PIC code implementing the same base algorithm as OSIRIS. This simpler version enables an easier exploration of the different implementations considered. In our experiments we used two data sets as inputs to ZPIC:
- LWFA, where the load distribution is not uniform among devices.
- Weibel, where a high number of particles crossing the region boundaries must be sent to other devices.
We have evaluated the OmpSs-2 + OpenACC approach on an IBM AC922 cluster based on POWER9 processors and NVIDIA Volta GPUs. Each node contains two POWER9 20-core processors and 4 NVIDIA V100 GPUs with 16GB of HBM2. A pair of GPUs is connected to each socket using NVLINK v2 with a measured 143 GB/s bidirectional bandwidth. Communication across pairs of GPUs uses the X-bus socket interconnect, with a measured bidirectional bandwidth of 38 GB/s.
Scaling experiments were also performed on a DGX-1 system, where NVIDIA v100 GPUs are organized in a hybrid cube-mesh using NVLINK v2 interconnection network topology; hence, communication with more than 2 GPUs does not have to go through a slow socket interconnect as in the case of the AC922 system. However, the DGX host is connected to the GPUs through PCIe, which is 4x slower than NVLINK. GPU-to-GPU bandwidth via NVLINK is also lower on DGX due to reduced links per device (2 links versus 3 on the AC922, resulting in 33% reduced maximum theoretical bandwidth).
Recently, we performed additional evaluation on a node of the Karolina supercomputer, with 2x AMD EPYC 7763 and 8x NVIDIA A100 GPUs (with 40GB of HBM2e). The GPU devices are interconnected by NVLINK v3 and NVSwitches, which allow any two devices to use the full NVLINK bandwidth (600 GB/s) to communicate. The GPUs are still connected to the host through PCIe.
OmpSs-2 + OpenACC is now part of the official release of OmpSs-2. Additionally, NVIDIA HPC SDK 20.11 (with CUDA 10.2) was used for providing the OpenACC support runtime library implementation and native compiler. All instances were compiled with the -fast, -O3 flags and the -ta=tesla:managed flag to enable automatic runs on Unified Memory.

Figure 1 shows strong scaling results for both the LWFA and Weibel data sets on the two systems under evaluation (AC922, and DGX), and for OmpSs-2 + OpenACC and pure OpenACC implementations. The AX922 node scales very well from 1 to 2 GPUs on both input sets, but it only scales to 4 GPUs for LWFA. This is due to the imbalance of computation in the Weibel input set. The good scalability on the AC922 is due to the direct communication provided by the NVLINK connection between the GPUs. It is also worth mentioning that for the LWFA input running on 4 GPUs, OmpSs-2 + OpenACC outperforms pure OpenACC by 15%, achieving a speedup close to 3x.
Regarding the performance obtained in the DGX node, it can be observed that both input sets scale properly from 1 to 4 GPUs, and additionally OmpSs-2 + OpenACC shows an additional speedup of up to 10% compared to pure OpenACC. Profiling confirms that some communication can be overlapped with computation in OmpSs-2 + OpenACC, in addition to getting a more flexible task scheduling environment, that allows to reduce data communications.
The graph in Figure 2 shows the performance results for the LWFA (strong scaling) and Weibel (weak scaling) on the Karolina supercomputer. Up to 4 GPUs, both implementations scale very well, attaining around 3x and 2.5x speedup with 4 devices for the LWFA and Weibel instances, respectively, with the slight edge for the OmpSs-2 + OpenACC implementation. Both approaches struggle to scale past 4 devices for the LWFA instance due to the non-uniform distribution of work and little computation per device. For the Weibel instance, both versions continue to scale well. We currently investigating the cause for the lower performance of the OmpSs-2 + OpenACC relative to the pure OpenACC with 8 GPUs.
Figure 2: Evaluation of OpenACC and OmpSs-2 + OpenACC on the Karolina supercomputer.

Overall, we believe that these results show the benefits of using OmpSs-2 + OpenACC, both in facilitating the writing of applications that can use the asynchronous features of the integration, and the increased performance obtained with our proposed environment.