Publications

Publications by Ricardo Gonçalves Macedo

2026

MinatoLoader: Accelerating Machine Learning Training Through Efficient Data Preprocessing

Authors
Nouaji, R; Bitchebe, S; Macedo, R; Balmau, O;

Publication
EuroSys

Abstract
Machine learning (ML) frameworks, such as PyTorch and TensorFlow, rely on data loaders to preprocess data before feeding it to accelerators. When preprocessing is inefficiently pipelined, GPUs can remain idle over long periods of time, leading to substantial training delays. For example, PyTorch’s default data loaders can cause up to 76% GPU idleness. A key bottleneck is the variability in preprocessing time across samples within the same dataset. Existing data loaders are oblivious to this variability, training all samples uniformly. In this case, a single slow sample can stall the entire batch, causing head-of-line blocking. We present MinatoLoader, a general-purpose data loader for PyTorch that accelerates training and improves GPU utilization under single-server, multi-GPU settings. It continuously prepares data in background and constructs batches by prioritizing fast-to-process samples, while slower samples are processed in parallel. Experiments conducted over NVIDIA V100 and A100 GPUs show that MinatoLoader accelerates training by up to 7.5× (3.6× on average) over PyTorch DataLoader and Pecan, and up to 3× (2.2× on average) over DALI. It also increases average GPU utilization from 46% with PyTorch to 90%, while preserving model accuracy and enabling faster convergence.

CloseRead Abstract

2026

Holpaca: Holistic and Adaptable Cache Management for Shared Environments

Authors
Peixoto, JP; González, A; Bhimani, J; Rangaswami, R; Brito, C; Paulo, J; Macedo, R;

Publication
ICPE

Abstract
Modern data-intensive systems rely on in-memory caching to achieve high throughput and low latency. CacheLib, Meta's general-purpose caching engine, provides high performance and flexibility for building specialized caches for a variety of applications. However, despite its wide adoption in large-scale infrastructures, CacheLib's data management mechanisms exhibit inefficiencies in shared environments. Particularly, its static and uncoordinated memory allocation leads to fragmented resource usage, unfair memory distribution, and degraded performance across tenants and instances. We present Holpaca, a general-purpose caching middleware that enables holistic and adaptable orchestration of shared caching environments. Holpaca introduces a shim data layer co-located with each cache instance and a centralized orchestrator with system-wide visibility, enabling global memory management and per-tenant QoS policies. Using production traces from Twitter, results show that, by continuously readjusting memory allocations based on workload dynamics, Holpaca achieves up to 3 higher throughput in multi-tenant and 2.2× improvement in multi-instance settings over CacheLib's rigid built-in mechanisms. © 2026 Owner/Author.

CloseRead Abstract

2026

Idiosyncrasies of Programmable Caching Engines

Authors
Peixoto, JP; González, A; Bhimani, J; Rangaswami, R; Brito, C; Paulo, J; Macedo, R;

Publication
CoRR

Abstract