21–23 Mar 2023
LaBRI
Europe/Paris timezone

Cloud-Bursting and Autoscaling for Python-Native Scientific and AI Workflows

22 Mar 2023, 15:00
10m
Salle Ada Lovelace (INRIA)

Salle Ada Lovelace

INRIA

Short talk Programming languages and runtimes Short Talks on Distributed Resources

Speaker

Mr Tingkai Liu (University of Illinois at Urbana-Champaign)

Description

We have extended the Ray framework to enable automatic scaling of workloads on high-performance computing (HPC) clusters managed by SLURM© and bursting to a Cloud managed by Kubernetes®. Our implementation allows a single Python-based parallel workload to be run concurrently across an HPC cluster and a Cloud. The Python-level abstraction provided by our solution offers a transparent user experience, requiring minimal adoption of the Ray framework. Applications in Electronic Design Automation and Machine Learning are used to demonstrate the functionality of this solution in scaling the workload on an on-premises HPC system and automatically bursting to a public Cloud when running out of allocated HPC resources. The paper focuses on describing the initial implementation and demonstrating novel functionality of the proposed framework using three applications as well as identifying practical considerations and limitations for using Cloud bursting mode.

JLESC topic HPC+Cloud

Primary authors

Mr Tingkai Liu (University of Illinois at Urbana-Champaign) Dr Marquita Ellis (IBM) Dr Carlos Costa (IBM) Dr Claudia Misale (IBM) Volodymyr Kindratenko (University of Illinois at Urbana-Champaign) Mrs Sara Kokkila-Schumacher (IBM)

Presentation materials

There are no materials yet.