Parallel architectures with non-uniform memory access (NUMA) are emerging as the norm in HPC clusters. In a NUMA system, CPUs and memory are organized as a set of interconnected nodes, where each node typically comprises one or more multi-core CPUs as well as one or more memory controllers. The non-uniform memory access nature stems from this organization, since the memory access bandwidth and latency depend on the node where the accessing thread runs and on the node where the target page resides.
When one deploys a parallel application on a NUMA system, its threads allocate and access pages that need to be physically mapped to the available NUMA nodes. This raises a crucial question: where should each page be mapped for optimal performance?
Node-to-node BWs (GB/s) on an 8-node AMD Opteron machine
When the application is memory-intensive, a common strategy is to uniformly interleave pages across the set of worker nodes, i.e., the nodes on which the application threads run. This strategy is based on the rationale that, for a large class of memory-intensive applications, bandwidth — rather than access latency — is the main bottleneck . Therefore, interleaving pages across nodes provides threads with a higher aggregate memory bandwidth. However, a preliminary study conducted by INESC-ID’s team at EPEEC showed that the memory bandwidth attained by the even page interleaving policy (as directly supported by the numactl library of Linux) can be considerably suboptimal for memory-intensive applications .
To overcome these inefficiencies, INESC-ID has designed and implemented BWAP, a novel bandwidth-aware page placement tool for memory-intensive parallel applications running on NUMA-based systems. In contrast to the uniform interleaving policy offered by Linux, BWAP takes the asymmetric bandwidths of every node into account to determine and enforce an optimized application-specific weighted interleaving.
A paper detailing BWAP’s design, implementation and evaluation was published at IPDPS 2020.
The source code of BWAP is available here: https://github.com/epeec/BWAP
BWAP in a nutshell
BWAP adopts a novel approach that combines two main techniques. In a first stage, BWAP builds a memory bandwidth model of the target system. From this model, BWAP calculates the optimal weight distribution that maximizes the performance of a reference bandwidth-intensive application. The key insight behind BWAP is that, after analytically determining that canonical weight distribution, that distribution can be adjusted to fit the target application by applying a scalar coefficient on each weight. In other words, BWAP reduces what in theory is an N-dimensional optimization problem (where N is the number of NUMA memory nodes) to the one-dimensional problem of finding an appropriate scaling coefficient that best fits the application. To achieve this, the second stage of BWAP relies on an iterative technique, which, when the application starts, places its pages according to the canonical weight distribution; then, on the fly, it uses an incremental page migration scheme that adjusts the weight distribution until a new (local) optimum is found.
Overview of BWAP's two-stage approach to page placement
BWAP is implemented as an extension to Linux libnuma. It enriches the original interface with a bandwidth-interleaved policy option that automatically determines memory nodes to place the application pages on, and the per-node weights to balance the page interleaving across the NUMA nodes. BWAP is completely implemented and tested. It is readily available and open source. It can be used transparently by any application, with no changes to the Linux kernel.
An exhaustive experimental evaluation with parallel shared-memory applications showed that BWAP can achieve up to 66% performance improvements when compared to state-of-the-art page placement strategies. More precisely, the highest improvements are obtained with parallel applications that exhibit high memory demand and only run on a subset of cores/CPUs in a NUMA system (e.g. for scalability constraints or due to co-location with other processes in the same machine).
How to use BWAP
Upon installation on a given machine
Before using BWAP to optimize page placement of applications in a given machine, an initial profiling tool, called canonical tuner, needs to be executed in order to devise a model of the underlying memory architecture. The canonical tuner runs a profiling procedure for a set of relevant combinations of worker node sets (with different sizes). The set of explored worker node sets does not need to be exhaustive: i) a large number of worker node sets can be filtered out since they are unlikely to be used by a rational user (e.g., in a dual-socket machine with 2+2 nodes, a 2-worker set comprising nodes at each socket, thus interconnected with a low BW); ii) many worker node sets are symmetrical, hence only one needs to be configured (e.g., in a dual-socket machine with 2+2 nodes with symmetric links between sockets, the optimal weight distribution for the worker set comprising two nodes on one socket is symmetrical to the set comprising the nodes on the other socket).
Preparing an application to use BWAP
The DWP tuner takes action when an application is launched. The tuner’s API includes a main function BWAP-init, which should be called by the target application once it has allocated its initial shared structures. Note that the DWP tuner targets applications, which, after an initial stage, enter an execution stage with stable memory access behaviour. The main argument of BWAP-init points to the set of worker nodes on which the application is running. No additional changes are needed to the application.
 B. Lepers, V. Quéma, and A. Fedorova, “Thread and memory placement on NUMA systems: Asymmetry matters”, in USENIX Conference on Usenix Annual Technical Conference (USENIX ATC). USENIX Association, USA, 277–289, 2015.
 D. Gureya, J. Neto, R. Karimi, J. Barreto, P. Bhatotia, V. Quema, R. Rodrigues, P. Romano, and V. Vlassov, “Bandwidth-Aware Page Placement in NUMA Systems”, in 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2020.