The EPEEC project has published initial software prototypes to support developers of distributed applications. The EPEEC software can be checked out from the GitHub repository:
The EPEEC collaboration supports application developers distributing their application on compute nodes and running the application in parallel on large clusters. EPEEC provides on one hand a distributed-memory programming scheme and on the other hand a shared-memory scheme.
The address space for the distributed-memory scheme is shown in Figure 1. The global address space is partitioned in the local and the remote part using hardware features such as remote direct memory access (RDMA). These allow parallel programs to optimally overlap computation and computation with asynchronous and non-blocking communication models such as GASPI (www.gaspi.de) with its implementation GPI (http://www.gpi-site.com/). Communication is started at the earliest possible stage in the execution of the program and the data to arrive in a timely manner, when it is really needed. Moreover, the increasing number of computing components demands more concurrency and the ability to communicate in a more fine-grained manner using tasks. In EPEEC we combine the GASPI/GPI communication with the OmpSs tasking model (https://pm.bsc.es/ompss) which uses the data-dependencies between different tasks of the program to parallelize the program.
Communication in distributed programs: In parallel programs intermediate results need to be exchanged between the compute components to ensure the progress of the whole program. In general, the following statement holds: Using more cores on the same problem size (i.e. strong scaling) means that the computation time for each core declines, while the communication overhead stays constant in a best case scenario. For very big systems the time spent for communication is crucial. It is necessary to employ sophisticated concepts for the communication of data. |
A coherent global address space in a distributed system enables shared memory programming in a much larger scale than a single multicore or a single SMP. This solution is an alternative to a distributed-memory scheme and is called shared memory scheme. It hides the distributed nature of the separate compute nodes to application developers, yielding a high programming productivity to the developers not willing to deal explicitly with the internode communications. Coherence is guarded by the programming model. In order not to lose scalability it is critical to make decisions locally to avoid long latencies imposed by the network. Here we are using OmpSs@ArgoDSM (ArgoDSM: https://argodsm.com/), which integrates ArgoDSM’s fully coherent distributed shared memory solution as the communication layer in the OmpSs-2@cluster programming model for task-based programs on distributed memory. OmpSs@ArgoDSM benefits from ArgoDSM’s advanced prefetching mechanisms to minimise the amount of long-latency communications.

The performance analysis tool Extrae (https://tools.bsc.es/extrae) and the visualisation tool Paraver (https://tools.bsc.es/paraver) help the application developers to understand the performance behaviour and to identify current bottlenecks. Figure 2 puts the whole picture together.

Author: Valeria Bartsch, Fraunhofer ITWM, Work Package Leader for Distributed Support in the EPEEC project
https://www.itwm.fraunhofer.de/en/departments/hpc/staff/valeria-bartsch.html
Follow me on Twitter: https://twitter.com/valeriabartsch