EPEEC partners BSC and INESC-ID work on an OmpSs@OpenACC version of the ZPIC application

In the context of the EPEEC project, we are introducing support for tasks written in OpenACC or with the OpenMP target directives, to exploit heterogeneous environments with SMP cores and GPUs. The current status of the development is that we have integrated our OmpSs-2 programming model [1], with the PGI compiler [2], and we are able to generate parallel tasks automatically for the GPUs, based on directive annotations. The outcome of this effort has been presented as master thesis [3] in the MIRI master program of the Facultat d’Informàtica de Barcelona at the Universitat Politècnica de Catalunya.


The OmpSs-2@OpenACC Programming Model

We have combined OmpSs-2 and OpenACC, in such a way that tasks annotated with OmpSs-2 can contain OpenACC directives. Those tasks are indicated to target OpenACC, and their code is compiled with the PGI compiler to get the binary for the target GPU. Figure 1 shows the representation of this compilation flow.

Figure 1: Compilation flow of OmpSs-2@OpenACC


The executable generated is then execute on top of the Nanos6 and the PGI runtimes. Nanos6 takes care of task creation, management, and scheduling at the SMP level. Those tasks targeting OpenACC, when executed, invoke the PGI runtime, in order to execute the kernels on the available GPUs. The Nanos6 runtime system supports task scheduling on multiple GPUs. Figure 2 shows the structure of the combined runtime systems. 


Figure 2: Structure of the combined Nanos6 – OpenACC runtime



With the environment presented, OmpSs-2@OpenACC has been used to annotate the ZPIC application [4], provided by INESC-ID.  ZPIC is a 2D plasma simulator using the PIC (particle-in-cell) algorithm. ZPIC was initially parallelised using plain OmpSs-2. OpenACC directives and device tasks were then added to the most computationally expensive functions.

The evaluation has been carried on one node of the Marenostrum CTE-Power9 cluster, with two Power9 chips, and 4 Nvidia V100 GPUs [5]. The OmpSs-2 Mercurium compiler and the PGI compiler [2] were used to compile the OmpSs-2 code. Mercurium invoked the PGI compiler for those tasks targeting the OpenACC device, and the PGI compiler generated the code targeting the GPUs, based on the OpenACC directives. 

Figure 3 presents the speedup achieved with OmpSs@OpenACC in ZPIC running on up to 4 GPUs, with respect to plain OmpSs-2 running on SMP. The case getting better performance is when using a single GPU, that achieves a speedup of 14.8 when using the 16 regions case. 
When using more than one GPU the performance degradation was determined to be due to two main reasons. First, the unified memory used by the PGI runtime to move data to and from the device becomes a serialization point for multiple OpenACC tasks. This essentially removes the benefit of parallel task provided by OmpSs-2. This unnecessary serialization can be overcome by moving away from leveraging unified memory for memory transfers and move that responsibility to the Nanos6 runtime. Second, the current OmpSs-2 scheduler does not take into consideration data affinity of OpenACC tasks. The current scheduler applies a round-robin policy between the tasks to be executed and the available devices, thus causing a large amount of data transfers between devices. It is part of our future work to devise a better OmpSs-2 scheduler that overcomes this issue.


Figure 3: Evaluation of ZPIC version with OmpSs@OpenACC



We have annotated a version of the ZPIC application, provided by INESC-ID, with OmpSs@OpenACC, and we have demonstrated that we can achieve parallel execution on GPUs, without recoding parts of the application using Nvidia CUDA. The experiments show that the use of a single GPU is performing well. But we need to improve the annotations, and the runtime support for both memory movement and task affinity, when using more than one GPU. 



[1] BSC. The OmpSs Programming Model. URL https://pm.bsc.es/ompss-2

[2] PGI. PGI version 19.1 Documentation for OpenPOWER and NVIDIA Processors.
URL https://www.pgroup.com/resources/docs/19.1/openpower/index.htm

[3] Orestis Korakitis. Towards supporting Composability of Directive-based Programming Models for Heterogenous Computing. Master thesis, URL https://epeec-project.eu/publications/towards-supporting-composability-directive-based-programming-models-heterogeneous.

[4] INESC-ID. ZPIC - OmpSs-2. URL https://github.com/nlg550/ZPIC_OmpSs2.

[5] NVIDIA Corporation. NVIDIA V100 Datasheet. URL