As high-performance systems evolve towards exascale and beyond, software development becomes increasingly more complex due to the rapid growth in the number of cores available in a single system. Many alternative programming models were proposed recently aiming to ease software development while increasing application efficiency. The main objective of this thesis is to evaluate the limitations and advantages of one of these new programming models (OmpSs-2) when applied to a real complex application. Our starting point is a sequential, 2D plasma simulator (ZPIC) that has the same core algorithm and functionalities as OSIRIS. In the first part of this thesis, we follow a spatial decomposition to parallelize ZPIC and target multicore CPUs. After applying a dynamic load balancing, our implementation not only achieves near-perfect scaling in one node of MareNostrum4 but also shows very consistent performance across all simulations. In the second part of this thesis, we target GPUs using a combination of OmpSs-2 and OpenACC. To efficiently use the device architecture, we introduce major changes to the ZPIC’s algorithm, including sorting the particles by tiles, using shared memory as cache and restructuring the particles’ data for coalesced memory accesses. The final implementation running on a single NVIDIA V100 GPU achieves up to 20x the performance of two IBM Power9 besides demonstrating excellent scaling for 2 GPUs as well as potential to scale up to 4 accelerators.
K. Matsumura, S. G. de Gonzalo and A. J. Peña, "JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization," 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC), 2021, pp. 182-191, doi: 10.1109/HiPC53243.2021.00032.
Guidotti, N. et al. (2021). Particle-In-Cell Simulation Using Asynchronous Tasking. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_30
Alexandro Baldassin, João Barreto, Daniel Castro, and Paolo Romano. 2021. Persistent Memory: A Survey of Programming Support and Implementations. ACM Comput. Surv. 1, 1, Article 1 (January 2021), 37 pages. https://doi.org/10.1145/3465402
Cloud computing has become ubiquitous due to its resource flexibility and cost efficiency. Resource flexibility allows Cloud users to elastically scale their Cloud resources, for instance, by horizontally scaling the number of virtual machines allocated to each application as the application demands change. However, matching resource demands to applications is non-trivial and applications experiencing highly dynamic workloads make it much more difficult. Cost efficiency is primarily achieved through workload consolidation, i.e., by co-locating applications on the same physical host. Unfortunately, workload consolidation often comes at a performance penalty, as consolidated applications contend for shared resources, leading to interference and performance unpredictability. Interference is particularly destructive for latency-critical applications, which must meet strict quality of service (QoS) requirements. Another significant technological trend is the growing prevalence of multi-socket systems in contemporary data centers. However, to the best of our knowledge, existing proposals for QoS-aware resource allocation are, by design, not tailored to multi-socket systems. Specifically, existing proposals do not support cross-socket sharing of memory, which entails a sub-optimal use of multi-socket host’s aggregate memory resources. This thesis focuses on two aspects of Cloud resource management namely, QoS-aware elasticity and resource arbitration, on two levels: inter-node resource management and intra-node resource management. In the first level, we consider the number of virtual machines (VMs) as the main resource to allocate and de-allocate for horizontal auto-scaling of an elastic service or application in the Cloud. In the intra-node resource management, we treat the memory bandwidth in multi-socket system as the resource to arbitrate among co-located applications. In both levels, the overall goal of this thesis is to provide resource management mechanisms that automatically adapt the resources allocated to data-intensive services to improve resource utilization while meeting service-level objectives (SLOs). In the context of inter-node resource management for auto-scaling of elastic Cloud services, this thesis improves the usefulness of elasticity controllers by addressing some of the challenges posed by current model-predictive control systems (such as training and tuning of the controller and adapting it to different workload patterns). To enable elastic execution of Cloud-based services using model-predictive control, we propose, implement, and evaluate OnlineElastMan, a self-trained proactive elasticity manager for Cloud-based storage services. OnlineElastMan excels its peers with its practical aspects, including easily measurable and obtainable performance and QoS metrics, automatic online training, and an embedded generic workload prediction module. Our evaluation shows that OnlineElastMan continuously improves its provision accuracy, minimizing provisioning cost and SLO violations, under various workload patterns. In the context of intra-node resource management, this thesis departs from the observation that, since state-of-the-art QoS-aware resource allocation systems disallow cross socket sharing of memory among consolidated applications, the memory bandwidth resources of multi-socket hosts cannot be properly exploited. Therefore, this thesis aims at filling that gap by designing, implementing and evaluating two novel techniques for memory bandwidth allocation for multi-socket Cloud nodes. First, we propose BWAP, a novel bandwidth-aware page placement tool for memory-intensive applications on non-uniform memory access (NUMA) systems. BWAP takes the asymmetric bandwidths of every NUMA node into account to determine and enforce an optimized application-specific weighted interleaving. Our evaluations on a diverse set of memory-intensive workloads, show that BWAP achieves up to 4× speedups when compared to a first-touch baseline policy (as provided by Linux’s default). Second, we propose BALM, a QoS-aware memory bandwidth allocation technique for multi-socket architectures. The key insight of BALM, is to combine commodity bandwidth allocation mechanisms originally designed for single socket with a novel adaptive cross-socket page migration scheme. Our evaluation shows that BALM can safeguard the SLO of latency-critical applications, with marginal SLO violation windows, while delivering up to 87% throughput gains to bandwidth-intensive best-effort applications compared to state-of-the-art alternatives. All solutions proposed and presented in this thesis, namely OnlineElastMan, BWAPand BALM, have been implemented and evaluated on real-world workloads. The result indicates the feasibility and effectiveness of our proposed approaches to improve inter-resource and intra-resource management through QoS-aware elastic execution and effective arbitration of resources among consolidated workloads in Cloud nodes.
Daniel Castro, Alexandro Baldassin, João Barreto and Paolo Romano. SPHT: Scalable Persistent Hardware Transactions. 19th USENIX Conference on File and Storage Technologies (FAST'21).
Bosch, J. et al. (2021). Task-Based Programming Models for Heterogeneous Recurrent Workloads. In: Derrien, S., Hannig, F., Diniz, P.C., Chillet, D. (eds) Applied Reconfigurable Computing. Architectures, Tools, and Applications. ARC 2021. Lecture Notes in Computer Science(), vol 12700. Springer, Cham. https://doi.org/10.1007/978-3-030-79025-7_8
Pérez Arroyo C., Dombard J., Duchaine F., Gicquel L., Martin B., Odier N., and Staffelbach G. (2021). Towards the LargeEddy Simulation of a full engine: Integration of a 360 azimuthal degrees fan, compressor and combustion chamber. Part I: Methodology and initialisation. Journal of the Global Power and Propulsion Society. Special Issue: Data-Driven Modelling and High-Fidelity Simulations: 1–16. https://doi.org/10.33737/jgpps/133115
P. Ekemark, Y. Yao, A. Ros, K. Sagonas and S. Kaxiras, "TSOPER: Efficient Coherence-Based Strict Persistency," 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 125-138, doi: 10.1109/HPCA51647.2021.00021.
Antonio J. Peña. A Software Ecosystem to Save Money in DRAM and Increase Performance with Optane DIMMs. Intel HPC+AI Pavilion. 2020.