Publication in Conference Proceedings/Workshop

Toledo, L. [et al.]. Static Graphs for Coding Productivity in OpenACC. A: International Conference on High Performance Computing. "2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC): 17-20 Dec. 2021, Bengaluru, India: proceedings". Institute of Electrical and Electronics Engineers (IEEE), 2022, p. 364-369. ISBN 978-1-6654-1016-8. DOI 10.1109/HiPC53243.2021.00050. 


DOI: 10.1109/HiPC53243.2021.00050
Article in journal

Toledo, L. [et al.]. Towards enhancing coding productivity for GPU programming using static graphs. "Electronics", 2022, vol. 11, núm. 9, 1307.


DOI: 10.3390/electronics11091307
Publication in Conference Proceedings/Workshop

Orestis Korakitis, Simon Garcia De Gonzalo, Nicolas Guidotti, João Pedro Barreto, José C. Monteiro, and Antonio J. Peña. 2022. Towards OmpSs-2 and OpenACC interoperation. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '22). Association for Computing Machinery, New York, NY, USA, 433–434. DOI:https://doi.org/10.1145/3503221.3508401


DOI: https://doi.org/10.1145/3503221.3508401
Thesis/dissertation

Software distributed shared memory (DSM) systems have been one of the main areas of research in the high-performance computing community. One of the many implementations of such systems is Argo, a page-based, user-space DSM, built on top of MPI. Researchers have dedicated considerable effort in making Argo easier to use and alleviate some of its shortcomings that are culprits in hurting performance and scaling. However, there are several issues left to be addressed, one of them concerning the simplistic distribution of pages across the nodes of a cluster. Since Argo works on page granularity, the page-based memory allocation or placement of pages in a distributedsystem is of significant importance to the performance, since it determines the extent of remote memory accesses. To ensure high performance, it is essential to employ memory allocation policies that allocate data in distributed memory modules intelligently, thus reducing latencies and increasing memory bandwidth. In this thesis,we incorporate several page placement policies on Argo and evaluate their impact on performance with a set of benchmarks ported on that programming model.


Oral presentation

Learn about the new possible interoperation between two pragma-based programming models: OmpSs-2 and OpenACC. Two pragma-based programming models made to function completely independent and unaware of each other can be made to effectively collaborate with minimal additional programming. We'll go over the separation of duties between models and describe in-depth the mechanism needed for interoperation. We'll provide concrete code examples using ZPIC, a 2D plasma simulator application written in OmpSs-2, OpenACC, and OmpSs-2 + OpenACC. We'll compare the performance and programmability benefits of OmpSs-2 + OpenACC ZPIC implementation against the other single-model implementations. OmpSs-2 + OpenACC is part of the latest OmpSs-2 release and all ZPIC implementations are open source.


Oral presentation

We'll showcase the integration of CUDA Graph with OpenACC, which allows developers to write applications that benefit from parallelism from the GPU, as well as increasing coding productivity. Since many scientific applications require high performance computing systems to make their calculations, it's important to provide a mechanism that allows developers to exploit the system's hardware to achieve the expected performance.

We will also explore the most important technical details regarding the integration of CUDA Graph and OpenACC. This allows programmers to define the workflow as a set of GPU tasks, potentially executing more than one at the same time.

Examples will be provided using CUDA, C++ and OpenACC, it will be expected that registrants ar familiar with at least the fundamentals of these programming languages.


Thesis/dissertation

Intel Optane DC persistent memory module (DCPMM) is an emergent non-volatile memory (NVM) technology  that  is  promising  due  to  its  byte-addressability,  high  density,  and  similar  performance  to DRAM. Prior literature explores a new architectural paradigm,  coined hybrid memory architecture (HMA), which results of the configuration of NVM as a memory tier between DRAM and storage.  HMAs have the potential to improve applications by enabling them to place a larger working set in fast memory, and thus reduce the need of evicting data to slow block-based storage.  HMAs also mitigate the well-known memory scalability problem, common in a plethora of large servers and supercomputers. These systems cannot deploy more physical memory due to size, energy or cost limitations, all of which are alleviated by NVM integration. However,  most NVM research explores its non-volatility as an enabler to faster data persistence, neglecting the scalability benefit offered by NVM integration, in the HMA scenario. Conversely, existing NVM research in the data placement field precedes the commercial availability of NVM, testing HMAs in often-inaccurate simulation-based environments, inferring NVM’s performance from outdated technologies. Our thesis proposes Ambix, the first published solution tested on a real system running DCPMM, which decides page placement dynamically in a Linux system.  We extensively discuss how different memory policies and distributions affect throughput and energy consumption in a DRAM-DCPMM system, leveraging the conclusions to guide Ambix ’s design. We show that Ambix has an up to 10x speedup in HPC-dedicated benchmarks, compared to the default memory policy in Linux.


Publication in Conference Proceedings/Workshop

R. Iakymchuk et al., "Efficient and Eventually Consistent Collective Operations," 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2021, pp. 621-630, doi: 10.1109/IPDPSW52791.2021.00096.


DOI: 10.1109/IPDPSW52791.2021.00096
Thesis/dissertation

The high-performance computing (HPC) industry is determinedly building towards next-generation exascale supercomputers. With this big leap in performance, the number of cores present in these future systems will be immense. Current state-of-the-art bulk synchronous two-sided communication models might not provide the massive performance and scalability required to exploit the power of these future systems. A paradigm shift towards an asynchronous communication and execution model to support the increasing number of nodes present in future supercomputers seems to be unavoidable. GASPI (Global Address Space Programming Interface) offers a Partitioned Global Address Space (PGAS) and allows for zero-copy data transfers that are completely asynchronous and one-sided, enabling a true overlap of communication and computation. Although promising, the PGAS model is still immature. Industrial-level HPC applications have yet to be developed with this model, which generates uncertainty about the model’s effectiveness with real-world applications. The goal of this thesis is to contribute to a better understanding of the actual strengths and limitations of the GASPI programming model when applied to HPC applications that will benefit from future exascale systems. To achieve that, we focused on the parallelization of a representative method from the domain of plasma physics, the Particle-in-Cell (PIC) method. Departing from an existing sequential implementation (ZPIC), we evaluated the performance and programming productivity of GASPI when used to parallelize this implementation. After a thorough performance evaluation on the MareNostrum 4 supercomputer we concluded that, while GASPI might fall behind the industry standard in terms of usability, its performance and scalability reliably outperformed an MPI implementation of the same application.


Thesis/dissertation

The high-performance computing (HPC) industry is determinedly building towards next-generation exascale supercomputers. With this big leap in performance, the number of cores present in these future systems will be immense. Current state-of-the-art bulk synchronous two-sided communication models might not provide the massive performance and scalability required to exploit the power of these future systems. A paradigm shift towards an asynchronous communication and execution model seems to be unavoidable. GASPI (Global Address Space Programming Interface) offers a Partitioned Global Address Space (PGAS) and allows for zero-copy data transfers that are completely asynchronous and one-sided, enabling a true overlap of communication and computation. Although promising, the PGAS model is still immature. Industrial-level HPC applications have yet to be developed with this model, which generates uncertainty about the model’s effectiveness with real-world applications. The goal of this thesis is to contribute to a better understanding of the actual strengths and limitations of the GASPI programming model when applied to HPC applications that will benefit from future exascale systems. To achieve that, we focused on the parallelization of a representative method from the domain of plasma physics, the Particle-in-Cell (PIC) method. Departing from an existing sequential implementation (ZPIC), we evaluated the performance and programming productivity of GASPI when used to parallelize this implementation. After a thorough performance evaluation on the MareNostrum 4 supercomputer we concluded that, while GASPI might fall behind the industry standard in terms of usability, its performance and scalability reliably outperformed an MPI implementation of the same application.