Publications

Publications by Diogo Luzio Leitão

2021

MONARCH: Hierarchical Storage Management for Deep Learning Frameworks

Authors
Dantas, M; Leitao, D; Correia, C; Macedo, R; Xu, WJ; Paulo, J;

Publication
2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021)

Abstract
Due to convenience and usability, many deep learning (DL) jobs resort to the available shared parallel file system (PFS) for storing and accessing training data when running in HPC environments. Under such a scenario, however, where multiple I/O-intensive applications operate concurrently, the PFS can quickly get saturated with simultaneous storage requests and become a critical performance bottleneck, leading to throughput variability and performance loss. We present MONARCH, a framework-agnostic middleware for hierarchical storage management. This solution leverages the existing storage tiers present at modern supercomputers (e.g., compute node's local storage, PFS) to improve DL training performance and alleviate the current I/O pressure of the shared PFS. We validate the applicability of our approach by developing and integrating an early prototype with the TensorFlow DL framework. Results show that MONARCH can reduce I/O operations submitted to the shared PFS by up to 45%, decreasing training time by 24% and 12%, for I/O-intensive models, namely LeNet and AlexNet.

CloseRead Abstract