The specific main components of the programming environment for exascale, carefully selected for the EPEEC project, are introduced in this figure below. Their new features, also shown in the figure, were identified as the necessary developments to bring the overall programming environment high programming productivity, high execution efficiency and scalability, energy awareness, and smooth composability/interoperability.
For a complete list of EPEEC software components visit the EPEEC GitHub.
Parallelware Analyzer (now rebranded as Codee) is a developer platform that, for the first time, enables to shift left performance by providing automated code inspection specifically designed to improve the performance of software. The ever-increasing requirements of software projects demand these new capabilities that are complementary to bug-catching, compliance or security. Codee provides a systematic, predictable approach to performance optimization that enables the inexperienced developers to write faster codes at the level of experts and alleviates the scarcity of senior qualified developers. The product is the first static code analyzer specificzing in performance for C/C++/Fortran code. It provides a performance optimization report with human-readable actionable items: opportunities, recommendations, defects and remarks. It supports several Application Programming Interfaces (APIs) for parallel programming using compiler directives. It annotates CPU and GPU codes with OpenMP, OpenACC and compiler-specific directives. It also detects defects in these directives, enabling the early detection of race conditions and data movement issues.
In the scope of the EPEEC project, we worked on the design of the Parallelware Analyzer's command-line user interface to favor usability and improve user experience, we developed new capabilities for automatic parallel code generation using compiler directives for vectorization and multithreading, and we contributed to the EPEEC guidelines by extending the catalog of performance optimization best practices correspondingly. The product is already at production level (TRL9) and has been deployed in leading HPC centers (e.g. NERSC, ORNL, BSC, KAUST, PAWSEY). In the future, Parallelware Analyzer will be developed further to provide more comprehensive support for programming environments used in HPC.
Link to Parallelware: https://codee.com/
The BSC performance tools, namely Extrae, Paraver, are currently widely deployed and extensively used by institutions such as NCAR or NASA-AMES, and in large projects such as DEEP projects or the POP Centre of Excellence, to name only a few. All software components have been deployed in pre-exascale machines, at least MareNostrum, and hence these are considered to be at TRL9. EPEEC has added GASPI and OpenACC support to these tools.
Link to Extrae: https://tools.bsc.es/extrae
Link to Paraver: https://tools.bsc.es/paraver
GASPI stands for Global Address Space Programming Interface and is a Partitioned Global Address Space (PGAS) API. It targets extreme scalability, high flexibility, and failure tolerance for parallel computing environments. GASPI aims to initiate a paradigm shift from bulk-synchronous two-sided communication patterns towards an asynchronous communication and execution model. To that end, GASPI leverages remote completion and one-sided RDMA-driven communication in a Partitioned Global Address Space. The asynchronous communication allows a perfect overlap between computation and communication. The main design idea of GASPI is to offer a lightweight API ensuring high performance, flexibility, and failure tolerance. GPI-2 is an open source implementation of the GASPI standard, freely available to application developers and researchers. It is already at production level (TRL9). However in a fast-changing innovative environment such as HPC, further developments are needed to incorporate hardware changes such as the use of accelerators and improving the ease of use of GPI, which is currently an obstacle for a wider adoption of the programming model.
In EPEEC we worked on new collectives for the EPEEC applications and beyond. We built a compression library called COMPREX which does a lossy compression by sparsification and local error accumulation for deep learning applications. We secured the GASPI and OmpSs composability with the TAGASPI library that has been first implemented in the INTERTWinE project. Traditionally GPI is used in visualisation and seismic imaging domains. In the past years GPI has been ported to several scientific applications within publicly funded projects. In the scope of EPEEC, the most remarkable speedups have been seen in the SMURFF and the QuantumEspresso applications.
Link to GASPI: https://github.com/epeec/GPI-2
The OmpSs programming model is implemented at BSC by means of the Mercurium compiler and the Nanos++ runtime system. A brand-new runtime system written from scratch, codenamed Nanos 6, is currently under development, intended to overcome code degradation and prevent future maintainability burdens. Both Mercurium and Nanos++ are currently at TRL9 - these have been deployed in the Tier-0 PRACE facility Marenostrum and there is a considerable number of known users in external institutions that seek support through the corresponding mailing list at BSC (e.g., CINECA, LRZ, JSC, Herta Security, and Vimar). Nanos 6 has achieved good maturity to be used as the base for the new developments within EPEEC. Directive-based acceleration by means of OpenACC syntax in OmpSs (i.e., OmpSs@OpenACC), OpenMP offloading support, tasking in accelerators, and advanced Fortran/C++ features support, are the key features being added as part of the project.
Link to OmpSs: https://github.com/epeec/nanos6
ArgoDSM is a modern page-based distributed shared memory system first released in 2016 (first publication describing ArgoDSM appeared in HPDC 201512). ArgoDSM is based on recent advances on cache coherence protocols and synchronisation algorithms at Uppsala University. ArgoDSM is a page-based distributed shared virtual-memory system that operates in user space. It supports third-party network layers. ArgoDSM offers: (1) POSIX Threads compatibility (runs pthreads on clusters); (2) Minimal effort to scale a typical pthreads program from one node to thousands of nodes; (3) User-space operation: ArgoDSM is a library on top of an RDMA-capable network layer; (4) Release consistency (RC) and sequential consistency for data-race-free research, evaluation, and education and is available under a custom license for commercial use. It is distributed by Eta Scale, its prototype implementations being evaluated by commercial entities under real-world conditions. Its main role in EPEEC is to serve as an advanced back-end for the OmpSs@cluster programming model.
Link to ArgoDSM: https://github.com/epeec/ArgoDSM
BWAP (Bandwidth-Aware Page Placement) is a novel bandwidth-aware page placement for memory-intensive applications on NUMA systems. In contrast to the uniform interleaving policy offered by Linux, BWAP takes the asymmetric BWs of every node into account to determine and enforce an optimized application-specific weighted interleaving.
Ambix performs dynamic page placement for hybrid multi-threaded architectures. It extends general placement mechanisms in order to consider an architecture that integrates Intel OptaneTM Persistent Memory. Ambix works with any platform where Intel OptaneTM is configured in “App Direct Mode”.
ecoHMEM (Software Ecosystem for Heterogeneous Memory Management) is a software framework for automatic data placement in heterogeneous memory systems. It performs automatic data distribution at object allocation granularity to improve performance and enable more energy efficient memory configurations in architectures incorporating several software-manageable memory tiers, such as systems equipped with Intel Optane Persistent Memory. It is currently composed of Extrae, HMem Advisor, and FlexMalloc. ecoHMEM is first publicly released by EPEEC.
For a complete list of EPEEC software components visit the EPEEC GitHub.