POP3 CoE - Performance Optimisation and Productivity CoE

33rd POP Webinar - ZeroSum: User Space Utility for Monitoring Process, Thread, OS and HW Resources, including GPU Utilization

by Kevin Huck (University of Oregon)

Europe/Berlin
Online

Online

Description

High Performance Computing (HPC) systems are large, heterogeneous, sophisticated – and are therefore so complicated that they are difficult to use efficiently. HPC users are allocated finite compute time on systems and yet have no portable utility to confirm that they are effectively utilizing the allocation at their disposal. To address these problems, ZeroSum is a user space library that is launched within the process space of the HPC application.  For each application process, it will monitor the application threads, MPI communication, and the hardware resources assigned to them – including CPU cores and/or hardware threads, memory usage and GPU utilization. Supported systems include Linux based operating systems, as well as GPUs from NVIDIA (using the NVML library), AMD (using the ROCm-SMI library) and Intel (using the SYCL API).

Host side monitoring utilizes the virtual /proc filesystem and therefore is portable to all Linux systems. When integrated with the hwloc library, visualizations of utilization data can be generated from included Python post-processing scripts.  Automatic deadlock detection is available, and ZeroSum will generate call stacks from all ranks, merge them, and visualize the resulting merged call stacks to help diagnose where expected behavior diverged (similar to STAT/Cray-STAT). Monitoring overhead is less than 0.5%.

POP CoE Task Lead Dissemination
Registration
Participants