EPEEC application highlights

Several applications are considered in EPEEC: AVBP (Cerfacs, a numerical simulation framework for the study of fluid dynamics and combustion problems), DIOGENeS (Inria, a numerical simulation framework for the study of nanoscale light-matter interaction problems), OSIRIS (INESC-ID, a numerical simulation framework for the study of plasma physics problems), Quantum ESPRESSO (Cineca, a set of numerical tools for the study of electronic properties of materials) and SMURFF (IMEC, a Bayesian matrix factorization framework for building recommender systems with applications to life sciences). In this technical news piece, we highlight some of the main achievements while porting these applications to the programming models developed in EPEEC, i.e., OmpSs, OmpSs@Cluster, GASPI and hybrid models such as OmpSs+OpenACC, OmpSs+ArgoDSM and OmpSs+GASPI.

 

AVBP

The AVBP project started in 1993 upon an initiative of Michael Rudgyard and Thilo Schönfeld with the aim to build a modern software tool for Computational Fluid Dynamics (CFD) within Cerfacs, of high flexibility, efficiency and modularity. Since then, the project grew rapidly and today, under the leadership of Thierry Poinsot, AVBP represents one of the most advanced CFD tools worldwide for the numerical simulation of unsteady turbulence for reacting flows. It is widely used both for basic research and applied research of industrial interest. AVBP uses an unstructured residual distribution Taylor-Galerkin finite-element scheme to simulate combustion in compressible fluids and executes in parallel SPMD using domain decomposition and communicating using MPI. A big challenge for running extremely large test cases is the load-balancing and partitioning of the unstructured mesh in AVBP. In EPEEC, we developed TreePart, a hardware-aware online dynamic load-balancer and partitioner for unstructured meshes that map partitions optimal to the hardware that greatly minimises the loadimbalance and improves communication using MPI3 shared memory halo exchanges. Aided by the OpenMP/OmpSs hybrid parallelisation we plan to further enhance the intra-node performance. We show performance results of using TreePart (full application) and the hybrid parallelisation (miniapplication) for the test cases shown in Figure 1.

Figure 1. Left: views of the use cases, simple burner (left-top) and Explo (left-bottom). Middle: TreePart improvements to full application. Right: OpenMP scaling results for mini-application
Figure 1. Left: views of the use cases, simple burner (left-top) and Explo (left-bottom). Middle: TreePart improvements to full application. Right: OpenMP scaling results for mini-application

Quantum ESPRESSO
Quantum ESPRESSO (QE) is an integrated suite of open-source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on a quantum mechanical modelling method called density-functional theory which allows properties to be calculated accurately but with relatively low computational costs. The main QE program has been efficiently optimised and is written in
Fortran90 and parallelised with MPI and OpenMP. There is also a CUDA Fortran version for Nvidia GPUs, but in EPEEC we have considered only the hybrid MPI+OpenMP code. The time-consuming kernels in the program are well-known and involve the use of linear algebra (e.g., matrix diagonalizations) and the Fast Fourier Transform (FFT) algorithm. Since these kernels rely on efficient libraries to be installed on the target architecture, QE provides two mini-applications, LAXlib and FFTXlib, to allow the testing of these kernels without having to run the full application. In EPEEC we decided to focus on the FFTXlib miniapp since it was suggested that there would be more scope for performance improvement in the FFT kernel compared to the linear algebra kernel, which is strongly dependent on the quality of the external linear algebra library. During the project, various optimization efforts were applied to the FFTXlib miniapplication, but the most successful has been that involving the use of the MPI+OmpSs model. Here OmpSs statements were used to annotate the main loop of the mini-application such that each loop iteration becomes an OmpSs task with data dependency clauses to ensure that the tasks are executed in the correct sequence. In Figure 2, we show the results of MPI+OmpSs optimization of the FFTXlib miniapplication compared to MPI and hybrid MPI+OpenMP parallelization on the Karolina supercomputer at IT4I. It can be seen that MPI+OmpSs gives significantly better performances than MPI alone or hybrid MPI+OpenMP on this architecture. The next step will be to repeat the analysis for larger input systems and the whole QE application.

 

Figure 2. Wall time as a function of nodes on Karolina for MPI only, MPI+OpenMP and MPI+OmpSs parallelization of the QE FFTXlib mini-application with the water benchmark. In each run either a single socket or a complete dual socket Karolina node was used, i.e., 64 or 128 MPI tasks (MPI), 32 MPI tasks+4 OpenMP threads (MPI+OpenMP) or just 32 MPI tasks for the case of OmpSs. Due to the lack of memory, OmpSs results are not available for less than 2 nodes.
Figure 2. Wall time as a function of nodes on Karolina for MPI only, MPI+OpenMP and MPI+OmpSs parallelization of the QE FFTXlib mini-application with the water benchmark. In each run either a single socket or a complete dual socket Karolina node was used, i.e., 64 or 128 MPI tasks (MPI), 32 MPI tasks+4 OpenMP threads (MPI+OpenMP) or just 32 MPI tasks for the case of OmpSs. Due to the lack of memory, OmpSs results are not available for less than 2 nodes.

 

OSIRIS/ZPIC
OSIRIS is a fully relativistic, massively parallel particle-in-cell (PIC) code used to simulate highly nonlinear and kinetic processes that occur in plasma physics. Under the action of extreme intensities, the collective behaviour and the nonlinearities in plasmas play a critical role and determine the dynamics of a wide variety of complex laboratory and astrophysical scenarios. In a PIC code, the full set of Maxwell equations is solved on a grid using currents and charge densities calculated by weighting discrete particles onto the grid. Each particle is pushed to a new position and momentum via self-consistently calculated fields. For the EPEEC project, a purely sequential, barebone PIC code called ZPIC has been considered. ZPIC implements exactly the same algorithm as OSIRIS and maintains all the main features of the latter, enabling an easier exploration of the different parallel programming paradigms within EPEEC. ZPIC has been ported to: Multicore CPU systems using both OpenMP and OmpSs; Distributed memory systems using hybrid implementations combining both MPI and GASPI with OmpSs; GPU
accelerators using both a pure OpenACC and OmpSs+OpenACC; FPGA accelerators using OmpSs+OpenCL. Figure 3 illustrates the main results from this effort on one problem instance. While the different programming technologies developed under EPEEC have been fully validated, perhaps the most striking conclusion is that asynchronous operation using tasks with data-dependencies provides excellent scalability with high coding productivity.

Figure 3. Top-left: shared-memory scaling on a single Marenostrum 4 node using up to 48 threads for OpenMP versions (red and green) and OmpSs (orange and blue), showing a near-perfect scaling of the fully-asynchronous task-based version based on data-dependencies. Top-right: distributed-memory versions, OmpSs+MPI vs OmpSs+GASPI (2 processes per node, 64 threads per process, on Karolina), have good and very similar performance up to 32768 cores. Bottom: speedup over CPU single core version when using accelerators GPU (OmpSs+OpenACC) and FPGA (OmpSs+OpenCL), with very good results for the GPU, but not so for the FPGA.
Figure 3. Top-left: shared-memory scaling on a single Marenostrum 4 node using up to 48 threads for OpenMP versions (red and green) and OmpSs (orange and blue), showing a near-perfect scaling of the fully-asynchronous task-based version based on data-dependencies. Top-right: distributed-memory versions, OmpSs+MPI vs OmpSs+GASPI (2 processes per node, 64 threads per process, on Karolina), have good and very similar performance up to 32768 cores. Bottom: speedup over CPU single core version when using accelerators GPU (OmpSs+OpenACC) and FPGA (OmpSs+OpenCL), with very good results for the GPU, but not so for the FPGA.

 

SMURFF
Bayesian Matrix Factorization (BPMF) is a powerful technique for recommender systems because it produces good results and is relatively robust against overfitting. Yet BPMF is more computationally intensive and thus more challenging to implement for large datasets. That is why we developed SMURFF, a high-performance feature-rich framework to compose and construct different Bayesian matrix-factorization methods. The framework has been successfully used in large-scale runs of compound-activity prediction. SMURFF is available as open source and can be used both on a supercomputer and on a desktop or laptop machine. In EPEEC we optimized two important use cases of SMURFF. The first use case, BPMF is the main driver to test the multi-node programming models developed in EPEEC. We evaluated performance and productivity improvements for BPMF using GASPI, OmpSs@ArgoDSM and OmpSs@Cluster. The second use case called Virtual Molecule Screening (or VMS), efficiently implements the prediction part of SMURFF. This test-case was used to develop GPU acceleration using OmpSs@OpenACC, and mapping on an FPGA using OmpSs@FPGA.

Figure 4. Virtual Molecule Screening.
Figure 4. Virtual Molecule Screening.

 

DIOGENeS
Developed by Inria, DIOGENeS is a software suite dedicated to computational. This software suite integrates several variants of the Discontinuous Galerkin (DG) method, which is a blend of finite element and finite volume methods. A DG method relies on an arbitrary high order polynomial interpolation of the field unknowns within the cells of an unstructured mesh. Such a mesh can be locally adapted to the peculiarities of irregularly shaped structures, material interfaces with complex topography, geometrical singularities, etc. As a consequence, a DG method is particularly well adapted to accurately and efficiently deal with the multiscale characteristics of nanoscale light-matter interaction problems. Numerical kernels of the DIOGENeS core library have been initially adapted to high performance computing thanks to a classical SPMD strategy implemented with the MPI message-passing programming standard. The DGTD (Discontinuous Galerkin-Time Domain) solver considered in EPEEC is one simulator, which is built on top of the DIOGENeS core library. This DGTD solver is a perfect candidate for a hybrid coarse grain/fine grain parallelization and, in the context of the EPEEC project, one achievement has been the implementation of this scenario by combining a SPMD strategy for internode parallelization using the MPI standard, with a task-based intra-node parallelization using OmpSs.

The hybrid MPI+OmpSs parallelization of the novel version of this DGTD solver has been used to simulate light propagation in a waveguide consisting of a chain of dielectric nanospheres (see Figure 5). A sample of performance figures for the strong scalability assessment of the hybrid MPI+OmpSs parallelization is shown in Figure 5. The problem that is considered here is challenging from the parallel scalability viewpoint because the presence of metallic, i.e., gold, nanospheres induces a computational load balance issue. Indeed, modeling the response of metallic nanostructures at optical frequencies requires to take into account a set of ordinary differential equations, which are coupled to the system of Maxwell equations but are solved only in the mesh cells that discretize the nanospheres. In this context, a fine grain task-based parallelization allows to mitigate to some extent this computational load balance issue.

Figure 5. Propagation of light in a waveguide consisting of a chain of gold nanospheres. Left: unstructured tetrahedral mesh of the computational domain. Center: snapshot of the module of the electric field.Right: strong scalability assessment of the hybrid MPI+OmpSs parallelization. DGTD-P2 refers to the DGTD solver with second order polynomial interpolation of the electromagnetic field within each mesh cell, and simulations are performed on 2 Marenostrum 4 nodes (2 MPI processes) with 1 to 24 tasks. DGTD-P4 refers to the DGTD solver with fourth order polynomial interpolation of the electromagnetic field within each mesh cell, and simulations are performed on 16 Marenostrum 4 nodes (16 MPI processes) with 1 to 24 tasks.
Figure 5. Propagation of light in a waveguide consisting of a chain of gold nanospheres. Left: unstructured tetrahedral mesh of the computational domain. Center: snapshot of the module of the electric field.Right: strong scalability assessment of the hybrid MPI+OmpSs parallelization. DGTD-P2 refers to the DGTD solver with second order polynomial interpolation of the electromagnetic field within each mesh cell, and simulations are performed on 2 Marenostrum 4 nodes (2 MPI processes) with 1 to 24 tasks. DGTD-P4 refers to the DGTD solver with fourth order polynomial interpolation of the electromagnetic field within each mesh cell, and simulations are performed on 16 Marenostrum 4 nodes (16 MPI processes) with 1 to 24 tasks.
​​​​​​