In July 2020, we announced the development of OmpSs-2@OpenACC [1] by the BSC Accelerators and Communication team using ZPIC [2], developed by INESC-ID, as our driving application. We have continued to improve the interoperation and co-design mechanisms by the addition of an affinity scheduler to minimize device-to-device communication. We currently outperform an equivalent multi-GPU OpenMP+OpenACC version of ZPIC without the programmer needing to manually manage asynchronous kernels or scheduling kernels to specific devices.
OpenMP + OpenACC (Bulk Synchronous) vs OmpSs-2@OpenACC (Data Flow)
To avoid the complexity of programming multiple GPUs using only OpenACC, it is standard practice to combine it with OpenMP. This is accomplished by having different devices targeted using different OpenMP threads which then invoke OpenACC kernels. However, this results in a bulk synchronous programming model where GPU kernels executing in different devices must be synchronized with OpenMP barriers. This bulk synchronous model can lead to devices becoming idle due to load imbalance as faster kernels have to wait for the slowest kernel. OmpSs-2@OpenACC can express a data flow parallel model where synchronization of different OpenACC kernels on different devices are implicit. OpenACC kernels are executed as soon as all their data dependencies are met.
Device Affinity
We had previously reported that the round-robin distribution of task across multi-GPUs and the Unified memory subsystem [3] used for memory transfers were the main reasons for poor performance as we increased the number of GPUs. We have addressed this issue by developing a device affinity scheduler and introducing two new runtime API calls:
- ompss_device_alloc(size, device_number)
- ompss_device_free(data_pointer)
The new API is used to allocate and free memory on a particular device. The device number used for allocation does not need to correspond to a physical device, thus the same code will work with an arbitrary number of GPUs. When a task is submitted to the scheduler to be assigned to a device, the data allocation and accesses are evaluated, and the device with the highest affinity is chosen.
Experiments and Results
The evaluation results shown in the figure below were obtained on one node of the MareNostrum CTE-Power9 cluster. The node contains two IBM POWER9 20-core processors and 4 NVIDIA V100 GPUs [4]. GPU’s are divided into pairs, and within each pair GPUs are connected using the NVLINKv2. Across pairs, GPUs communicate using the X-bus socket interconnect which is 4x slower. OmpSs-2 v2.5 with the affinity scheduling additions were used for compilations and NVIDIA HPC SDK 20.11 (with CUDA 10.2) was used for providing the OpenACC support and native compiler. The baseline for all results is the OpenACC single GPU execution.
The speedup over the OpenACC version is partly due to the improved utilization of the GPU. The data flow model of OmpSs-2@OpenACC allows for less idling and more kernels that can be executed concurrently, while the OpenACC is limited to one kernel at a time. As the number of GPUs increases to three and four, the slow socket interconnect latency can be better hidden by the OmpSs-2@OpenACC version.
Conclusions
We have improved the performance of OmpSs-2@OpenACC by introducing a new affinity scheduler that takes advantage of two new memory allocation APIs. This new approach will also be merged in a next OmpSs-2 release, providing benefit for all OmpSs-2 device tasks. The OmpSs-2@OpenACC annotated a version of the ZPIC application, provided by INESC-ID, can achieve better performance on a multi-GPU system compared to an OpenMP+OpenACC version. For more information on this topic please watch for our upcoming publication. We plan to further improve performance by moving the responsibility of memory movement from the Unified Memory subsystem to the OmpSs2 runtime.
References
[1] EPEEC News. URL https://epeec-project.eu/media/news/epeec-partners-bsc-and-inesc-id-work-ompssopenacc-version-zpic-application
[2] INESC-ID. ZPIC - OmpSs-2. URL https://github.com/nlg550/ZPIC_OmpSs2
[3] NVIDIA Blog: Maximizing Unified Memory performance in CUDA. URL https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/
[4] NVIDIA Corporation. NVIDIA V100 Datasheet. URL https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf