Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
About

About

Filipe Oliveira completed his Bachelor's Degree in Computer Science in 2021 and is currently finishing his Master's Degree at the School of Management and Technology, of the Polytechnic of Porto. His bachelor's final project was developed in the field of Machine Learning (ML). This project was carried out around the Distributed Machine Learning concept. Currently, he is an Invited Assistant Professor at the same institution. As a research enthusiast and passionate about the fascinating field of Artificial Intelligence, he finds real satisfaction in exploring the advances and challenges of this ever-evolving area. Over these few years, he had the opportunity to contribute to the knowledge in this domain, having written several scientific articles that address crucial issues and innovative solutions in Machine Learning.

Interest
Topics
Details

Details

  • Name

    Filipe Vamonde Oliveira
  • Role

    Research Assistant
  • Since

    15th February 2023
002
Publications

2023

The Impact of Data Selection Strategies on Distributed Model Performance

Authors
Guimarães, M; Oliveira, F; Carneiro, D; Novais, P;

Publication
Lecture Notes in Networks and Systems

Abstract
Distributed Machine Learning, in which data and learning tasks are scattered across a cluster of computers, is one of the answers of the field to the challenges posed by Big Data. Still, in an era in which data abounds, decisions must still be made regarding which specific data to use on the training of the model, either because the amount of available data is simply too large, or because the training time or complexity of the model must be kept low. Typical approaches include, for example, selection based on data freshness. However, old data are not necessarily outdated and might still contain relevant patterns. Likewise, relying only on recent data may significantly decrease data diversity and representativity, and decrease model quality. The goal of this paper is to compare different heuristics for selecting data in a distributed Machine Learning scenario. Specifically, we ascertain whether selecting data based on their characteristics (meta-features), and optimizing for maximum diversity, improves model quality while, eventually, allowing to reduce model complexity. This will allow to develop more informed data selection strategies in distributed settings, in which the criteria are not only the location of the data or the state of each node in the cluster, but also include intrinsic and relevant characteristics of the data. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

2023

Block size, parallelism and predictive performance: finding the sweet spot in distributed learning

Authors
Oliveira, F; Carneiro, D; Guimaraes, M; Oliveira, O; Novais, P;

Publication
INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS

Abstract
As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.

2023

Predicting Model Training Time to Optimize Distributed Machine Learning Applications

Authors
Guimaraes, M; Carneiro, D; Palumbo, G; Oliveira, F; Oliveira, O; Alves, V; Novais, P;

Publication
ELECTRONICS

Abstract
Despite major advances in recent years, the field of Machine Learning continues to face research and technical challenges. Mostly, these stem from big data and streaming data, which require models to be frequently updated or re-trained, at the expense of significant computational resources. One solution is the use of distributed learning algorithms, which can learn in a distributed manner, from distributed datasets. In this paper, we describe CEDEs-a distributed learning system in which models are heterogeneous distributed Ensembles, i.e., complex models constituted by different base models, trained with different and distributed subsets of data. Specifically, we address the issue of predicting the training time of a given model, given its characteristics and the characteristics of the data. Given that the creation of an Ensemble may imply the training of hundreds of base models, information about the predicted duration of each of these individual tasks is paramount for an efficient management of the cluster's computational resources and for minimizing makespan, i.e., the time it takes to train the whole Ensemble. Results show that the proposed approach is able to predict the training time of Decision Trees with an average error of 0.103 s, and the training time of Neural Networks with an average error of 21.263 s. We also show how results depend significantly on the hyperparameters of the model and on the characteristics of the input data.

2023

Dynamic Management of Distributed Machine Learning Projects

Authors
Oliveira, F; Alves, A; Moço, H; Monteiro, J; Oliveira Ó; Carneiro, D; Novais, P;

Publication
Studies in Computational Intelligence

Abstract
Given the new requirements of Machine Learning problems in the last years, especially in what concerns the volume, diversity and speed of data, new approaches are needed to deal with the associated challenges. In this paper we describe CEDEs - a distributed learning system that runs on top of an Hadoop cluster and takes advantage of blocks, replication and balancing. CEDEs trains models in a distributed manner following the principle of data locality, and is able to change parts of the model through an optimization module, thus allowing a model to evolve over time as the data changes. This paper describes its generic architecture, details the implementation of the first modules, and provides a first validation. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

2021

A Data-Locality-Aware Distributed Learning System

Authors
Carneiro, D; Oliveira, F; Novais, P;

Publication
Ambient Intelligence - Software and Applications - 12th International Symposium on Ambient Intelligence, ISAmI 2021, Salamanca, Spain, 6-8 October, 2021.

Abstract
Machine Learning problems are significantly growing in complexity, either due to an increase in the volume of data, to new forms of data, or due to the change of data over time. This poses new challenges that are both technical and scientific. In this paper we propose a Distributed Learning System that runs on top of a Hadoop cluster, leveraging its native functionalities. It is guided by the principle of data locality. Data are distributed across the cluster, so models are also distributed and trained in parallel. Models are thus seen as Ensembles of base models, and predictions are made by combining the predictions of the base models. Moreover, models are replicated and distributed across the cluster, so that multiple nodes can answer requests. This results in a system that is both resilient and with high availability. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.