21–23 Mar 2023
LaBRI
Europe/Paris timezone

Controlling the Energy Efficiency of HPC Nodes - A Reinforcement Learning Based Approach

22 Mar 2023, 14:20
10m
LaBRI Amphi (LaBRI)

LaBRI Amphi

LaBRI

Short talk Advanced architectures Short Talks on Advanced Architectures

Speaker

Akhilesh Raj (Student Researcher at Argonne National Lab)

Description

Exascale systems draw a significant amount of power. As each application
deployed map to the various heterogeneous computing elements of these
platforms, managing how power is distributed across components becomes a
priority. The ECP Argo project is developing an infrastructure for node-local control loops that can observe application behavior and adjust resources dynamically, power included, for better performance. We have recently developed a control loop using reinforcement learning, with a proximal policy optimization algorithm, trained on an existing mathematical model of application progress response to power capping. This dependency on the mathematical model is a hindrance: progress/instantaneous performance is stochastic (noisy) under a dynamic workload and therefore a good approximation model demands more data, and lengthy characterization studies. Therefore, we are exploring methods for bypassing this mathematical model, like actor-critic methods, and are looking for collaborations with know-how on other options, for example: real-time training, existing fully characterized applications, alternative control loop designs.

Primary authors

Akhilesh Raj (Student Researcher at Argonne National Lab) Swann Perarnau (Argonne National Laboratory)

Presentation materials

There are no materials yet.