Speaker
Description
This is the report for the project 'Optimization of Fault-Tolerance Strategies for Workflow Applications'
Checkpoint operations are periodic and high-volume I/O operations and, as such, are particularly sensitive to interferences. Indeed, HPC applications execute on dedicated nodes but share the I/O system. As a consequence, interferences surge when several applications perform I/O operations simultaneously: each I/O operation takes much longer than expected because each application is only allotted a fraction of the I/O bandwidth.
This is the motivation for our study about I/O interference.
We design and evaluate several new algorithms for bandwidth sharing,
which we compare with existing work. We do not assume any knowledge
of the applications nor any regularity pattern in I/O operations.
Overall, this project talk is NOT about resilience, even though concurrent checkpoints were the initial motivation.
JLESC topic | I/O |
---|