Description
Chairperson: Terry Cojean (UTK)
In this paper, we propose an approach based on signal processing to characterize HPC applications’ temporal I/O behavior. In the context of each application, our goal is to detect/predict the temporal aspects of its access pattern, i.e. the I/O phases (each composed of one or many individual I/O requests) and their periodicity. Such information can very useful for optimization techniques such...
The growing complexity arsing in the development of HPC libraries and applications impedes speedy code development. To reel in this complexity, CI tools and workflows are a great way to automate large portions of test-driven development cycles.
In this short-talk we want to present the current impact of our CI-HPC tools to automate such workflows. Our FMM library FMsolvr will be used as a...
Many HPC applications display iterative patterns, where a series of computations and communications are repeated a specific number of times. This pattern happens, for example, in multi-step simulations, iterative mathematical methods and machine learning training. When these applications are coded using data-flow programming models, much time is spent creating tasks and processing dependencies...
Checkpointing is the most widely used approach to provide resilience for HPC applications by enabling restart in case of failures. However, coupled with a searchable lineage that records the evolution of intermediate data and metadata during runtime, it can become a powerful technique in a wide range of scenarios at scale: verify and understand the results more thoroughly by sharing and...
One of the information that HPC batch schedulers use to schedule jobs on the
available resources is user runtime estimates: an estimation provide by the user
of how long their job will run on the machine. These estimates are known to be
inaccurate, hence many work have focused on improving runtime prediction.
In this work, we start by discussing bias and limitations of the most...
Elasticity, or the ability to adapt a system to a dynamically changing workload, has been a core feature of Cloud Computing storage since its inception more than two decades ago. In the meantime HPC applications have mostly continued to rely on static parallel file systems to store their data. This picture is now changing as more and more applications adopt custom data services tailored to...