21–23 Mar 2023
LaBRI
Europe/Paris timezone

Understanding the relation between monitoring events and topology of exascale architectures for HPC applications

23 Mar 2023, 11:30
10m
LaBRI Amphi (LaBRI)

LaBRI Amphi

LaBRI

Short talk Performance tools Short Talks on Interative Tools and Monitoring

Speaker

Idriss Daoudi (Argonne National Laboratory)

Description

With an increasing workload diversity and hardware complexity in HPC, the boundaries of today's runtimes are pushed to their limits. This evolution needs to be matched by corresponding increases in the capabilities of system management solutions.

Power management is a key element in the upcoming exascale era. First to allow us to stay within the power budget, but also for the applications to make the most of the available power in order to make progress. Therefore, our objective is to balance complex applications requirements while keeping power consumption under budget.

To achieve this goal, the Argo group is working on the Node Resource Manager (NRM) tool, which allows us to centralize node management activities such as resource and power management. The latter is achieved by getting information (monitoring) from various sensors (power, temperature, fan speed, frequency...) and adjusting actuators (CPU p-states, Intel RAPL) according to the application needs. The next step in our power management strategy is to improve NRM monitoring to more easily identify the location (within the topology) and scope (range of devices) that monitoring events are related to.

To evaluate our implementation, we are looking for JLESC members willing to extend this work with more complex applications with dynamic resource balancing problems, on which we first can observe such imbalance, and then address it with a better power management strategy relying on precise identification of the relation between the gathered monitoring events, the devices present, and the inner components of applications. We are aiming to get a better understanding of the behavior of such applications under various scenarios of power management, as well as studying the possibility of characterizing applications' power needs in order to develop an automated resource management policy.

Primary author

Idriss Daoudi (Argonne National Laboratory)

Presentation materials

There are no materials yet.