Speaker
Description
With an increasing workload diversity and hardware complexity in HPC, the boundaries of today's runtimes are pushed to their limits. This evolution needs to be matched by corresponding increases in the capabilities of system management solutions.
Power management is a key element in the upcoming exascale era. First to allow us to stay within the power budget, but also for the applications to make the most of the available power in order to make progress. Therefore, our objective is to balance complex applications requirements while keeping power consumption under budget.
To achieve this goal, the Argo group is working on the Node Resource Manager (NRM) tool, which allows us to centralize node management activities such as resource and power management. The latter is achieved by getting information (monitoring) from various sensors (power, temperature, fan speed, frequency...) and adjusting actuators (CPU p-states, Intel RAPL) according to the application needs. The next step in our power management strategy is to improve NRM monitoring to more easily identify the location (within the topology) and scope (range of devices) that monitoring events are related to.
To evaluate our implementation, we are looking for JLESC members willing to extend this work with more complex applications with dynamic resource balancing problems, on which we first can observe such imbalance, and then address it with a better power management strategy relying on precise identification of the relation between the gathered monitoring events, the devices present, and the inner components of applications. We are aiming to get a better understanding of the behavior of such applications under various scenarios of power management, as well as studying the possibility of characterizing applications' power needs in order to develop an automated resource management policy.