After reaching the limits of hardware integration and power dissipation, major chip vendors extended computer architecture to use multi- and many-core and accelerator technologies. Adding multiple cores impacts programming, introducing the need of parallelism, and slightly increasing the complexity of the programs. Accelerators further complicate programming because of their specific characteristics.
While incorporating several cores do not change the way each one is programmed, and one can keep using his/her preferred programming language, programming accelerators most usually require the introduction of additional functionalities such as offloading the code to the accelerator and transferring the data needed for the computation back and forth.
Although accelerators can be designed and introduced in the architecture in many ways, there are two major trends that cover probably more than 90% of the cases. First GPUs and GPGPUs were orginally used to take advantage of graphics extensions to add new computing facilities available to the programmer. Graphics Units (GPUs) evolved to allow general-purpose programming, becoming GPGPUs. Even if they support general computation, most times they can only be programmed with specific programming languages developed purposely to support them. Existing and most common languages are CUDA (www.nvidia.com/CUDA), OpenCL (www.khronos.org/OpenCL), and SYCL (www.khronos.org/sycl).
As an example of source code, we have our EPEEC-proposed OmpSs@OpenACC approach, where we combine OmpSs and OpenACC directives to have a single and compact source code, where the programmer can see all code at once, and this way, better understand and maintain it. Figure 1 shows the OmpSs@OpenACC source code sample for vector addition algorithm.
Our technique for compiling these kinds of applications is shown in Figure 2. The source code is given to the Mercurium compiler that splits it into parts, so that the main program goes for the host processor compilation (left side of Figure 2), and the device code goes to (possibly) a variety of device compilers (right side of Figure 2). In the OmpSs@OpenACC case, we use the PGI compiler to compile the code and link with the necessary libraries.
The same approach is used for OmpSs@CUDA, OmpSs@OpenCL, and OmpSs@FPGA. We also plan to work on the OmpSs@SYCL, and OmpSs@OpenMP-target flavors, to get the best of the various worlds.
As a comparison on the style of programming and the abstraction level that we get with OmpSs@OpenACC, Figure 3 shows the equivalent code in OmpSs@SYCL. As it can be observed, the programmer still needs to take care to prepare the data in specific buffers, and build accessor variables for the accelerator to access the data. With our approach in OmpSs, using the OpenACC and OpenMP-target backends we achieve the goal of hiding these details among the compiler and the runtime system.
A small evaluation of the number of additional lines that requires each version compared to the serial version is shown in Table 1. As a reference, we assume the serial version has a for loop with a body consisting of a single sentence (3 lines). The SYCL version adds 12 lines of boilerplate to prepare buffers, the accessors and the kernel. Instead, OmpSs@OpenACC shows its benefit by simply adding a function call, to encapsulate the loop in a function, and 3 directives (OSS task, ACC parallel loop, and taskwait) to achieve the same purpose. We even expect that in the future, we can avoid the need to outlining the task.
With the OmpSs@OpenACC approach we will be able to achieve more productivity while running efficiently in GPUs in the development of applications in various fields, ranging from automotive to medical, to space applications. In the EPEEC project, we are working with the applications AVBP (compressible reactive multiphase flows simulations), DIOGENeS (computational light/matter interaction problems), OSIRIS (particle-in-cell simulations), Quantum ESPRESSO (solver for the Kohn-Sham equations), and SMURFF (machine learning).