Publications

Publications by Ricardo Gonçalves Macedo

2021

The Case for Storage Optimization Decoupling in Deep Learning Frameworks

Authors
Macedo, R; Correia, C; Dantas, M; Brito, C; Xu, WJ; Tanimura, Y; Haga, J; Paulo, J;

Publication
2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021)

Abstract
Deep Learning (DL) training requires efficient access to large collections of data, leading DL frameworks to implement individual I/O optimizations to take full advantage of storage performance. However, these optimizations are intrinsic to each framework, limiting their applicability and portability across DL solutions, while making them inefficient for scenarios where multiple applications compete for shared storage resources. We argue that storage optimizations should be decoupled from DL frameworks and moved to a dedicated storage layer. To achieve this, we propose a new Software-Defined Storage architecture for accelerating DL training performance. The data plane implements self-contained, generally applicable I/O optimizations, while the control plane dynamically adapts them to cope with workload variations and multi-tenant environments. We validate the applicability and portability of our approach by developing and integrating an early prototype with the TensorFlow and PyTorch frameworks. Results show that our I/O optimizations significantly reduce DL training time by up to 54% and 63% for TensorFlow and PyTorch baseline configurations, while providing similar performance benefits to framework-intrinsic I/O mechanisms provided by TensorFlow.

CloseRead Abstract

2021

Pods-as-Volumes: Effortlessly Integrating Storage Systems and Middleware into Kubernetes

Authors
Faria, A; Macedo, R; Paulo, J;

Publication
WOC '21: Proceedings of the Seventh International Workshop on Container Technologies and Container Clouds, Virtual Event, Canada, 6 December 2021

Abstract

2022

Accelerating Deep Learning Training Through Transparent Storage Tiering

Authors
Dantas, M; Leitao, D; Cui, P; Macedo, R; Liu, XL; Xu, WJ; Paulo, J;

Publication
2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022)

Abstract
We present MONARCH, a framework-agnostic storage middleware that transparently employs storage tiering to accelerate Deep Learning (DL) training. It leverages existing storage tiers of modern supercomputers (i.e., compute node's local storage and shared parallel file system (PFS)), while considering the I/O patterns of DL frameworks to improve data placement across tiers. MONARCH aims at accelerating DL training and decreasing the I/O pressure imposed over the PFS. We apply MONARCH to TensorFlow and PyTorch, while validating its performance and applicability under different models and dataset sizes. Results show that, even when the training dataset can only be partially stored at local storage, MONARCH reduces TensorFlow's and PyTorch's training time by up to 28% and 37% for I/O-intensive models, respectively. Furthermore, MONARCH decreases the number of I/O operations submitted to the PFS by up to 56%.

CloseRead Abstract

2022

PAIO: General, Portable I/O Optimizations With Minor Application Modifications

Authors
Macedo, R; Tanimura, Y; Haga, J; Chidarnbaram, V; Pereira, J; Paulo, J;

Publication
PROCEEDINGS OF THE 20TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES, FAST 2022

Abstract
We present PAID, a framework that allows developers to implement portable I/O policies and optimizations for different applications with minor modifications to their original code base. The chief insight behind PALO is that if we are able to intercept and differentiate requests as they flow through different layers of the I/O stack, we can enforce complex storage policies without significantly changing the layers themselves. PAIO adopts ideas from the Software-Defined Storage community, building data plane stages that mediate and optimize I/O requests across layers and a control plane that coordinates and fine-tunes stages according to different storage policies. We demonstrate the performance and applicability of PALO with two use cases. The first improves 99th percentile latency by 4 x in industry-standard LSM-based key-value stores. The second ensures dynamic per-application bandwidth guarantees under shared storage environments.

CloseRead Abstract

2022

Protecting Metadata Servers From Harm Through Application-level I/O Control

Authors
Macedo, R; Miranda, M; Tanimura, Y; Haga, J; Ruhela, A; Harrell, SL; Evans, RT; Paulo, J;

Publication
2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022)

Abstract
Modern large-scale I/O applications that run on HPC infrastructures are increasingly becoming metadata-intensive. Unfortunately, having multiple concurrent applications submitting massive amounts of metadata operations can easily saturate the shared parallel file system's metadata resources, leading to unresponsiveness of the storage backend and overall performance degradation. To address these challenges, we present PADLL, a storage middleware that enables system administrators to proactively control and ensure QoS over metadata workflows in HPC storage systems. We demonstrate its performance and feasibility by controlling the rate of both synthetic and realistic I/O workloads. Results show that PADLL can dynamically control metadata-aggressive workloads, prevent I/O burstiness, and ensure I/O fairness and prioritization.

CloseRead Abstract

2023

Diagnosing applications' I/O behavior through system call observability

Authors
Esteves, T; Macedo, R; Oliveira, R; Paulo, J;

Publication
CoRR

Abstract