Given the expected levels of power consumption of future exascale computing systems, energy efficiency will be a main concern for application developers. In the context of EPEEC, we plan to extend the nanos6 runtime system such that it is able to dynamically decide which implementation of a given spawned task (among multiple alternatives) is more efficient to run at a given point of computation, either in terms of execution time, or in terms of power consumption, or even a combination of both. For this, different implementations of the same task, eventually targeting multiple computational devices, must be available and must be characterized.
We have developed a profiling tool that is able to collect energy statistics of individual tasks from running code and produce structured information that the nanos6 runtime can read and use to select the most appropriate task implementation to run during program execution.
Profiling Tool
The current version of the tool supports the profiling of x86_64 CPUs and Nvidia GPUs. On the CPU side, Intel RAPL was used through Linux’s power capping interface. This interface provides access to the accumulated energy consumption of the CPU package, both from its cores and other integrated components, as well as the socket’s memory. On the GPU side, the official Nvidia Management Interface was used, which offers a way to gather the energy consumption of the whole graphics card, including memory. Both the support for AMD GPUs and IBM's POWER CPUs is being looked into.
In addition to measuring the total energy output of a user-specified task in the code, the tool also outputs the energy spent for the computation itself, i.e., the energy spent idling subtracted from the total energy consumption. The tool is able to obtain these values by doing an initial evaluation during which the devices do nothing.
One problem with the CPU interface is that it only exposes the energy consumption of the whole socket, which means that sampling during a concurrent workload may generate inaccurate results. The current implementation simulates a single-threaded execution by stopping all threads other than the one being measured. As a result, it is not yet possible to measure workloads which require inter-thread communication. However, such a limitation is not an issue for a task-based model whose synchronization mechanisms rely on predefined data-dependencies.
At the moment of writing, the tool has difficulties with asynchronous workloads such as GPU kernels, but, for example, CUDA provides a way to instruct its runtime to launch blocking kernels. Asynchronous support on the CPU side is being investigated.
Integration with the Runtime System
Once the profiling tool has been used on a sample, short execution of an application, the energy consumption trends’ data, in the form of JSON files, can be used by the nanos6 runtime. During initialization, the runtime will detect if such a file is being provided, will parse it and create an internal directory of those measurements. Since this data will be provided in a per-task basis, the runtime will be using the label for each task. This is an optional clause of OmpSs-2 task directives, but necessary in this context to meaningfully associate profiler data with actual tasks and, thus, required to be added in the application code.
During program execution, when a task with multiple implementations becomes ready (waiting to be scheduled when resources are available) the runtime will trigger the Energy Manager, a new feature, currently under development, for this work. The Energy Manager checks if the task’s label corresponds to an entry of the energy profiling directory. This is the point where the task implementation choice has to be made, based on this data and/or other factors of the runtime’s current state:
- Prioritize least energy consumption?
- Prioritize best execution time?
- Favour load-balancing or waiting time (e.g., when a particular device type is busy)?
- Combinations of the above or even more advanced heuristics?
When the choice is made, the selected task implementation will be passed to the corresponding runtime scheduler (host thread or device) to be executed accordingly. This clear separation of the Energy Manager from the Scheduler and executing facilities of the runtime system allows orthogonal experimentation with different approaches in task implementation selection, without interfering with other runtime subsystems.