Cookies
O website necessita de alguns cookies e outros recursos semelhantes para funcionar. Caso o permita, o INESC TEC irá utilizar cookies para recolher dados sobre as suas visitas, contribuindo, assim, para estatísticas agregadas que permitem melhorar o nosso serviço. Ver mais
Aceitar Rejeitar
  • Menu
Sobre

Sobre

Filipe Oliveira concluiu a licenciatura em Informática em 2021 e encontra-se atualmente a terminar o mestrado na Escola Superior de Tecnologia e Gestão, do Politécnico do Porto. O seu projeto final de licenciatura foi desenvolvido na área de Machine Learning (ML). Este projeto foi realizado em torno do conceito de Machine Learning Distribuído. Atualmente, é Professor Assistente Convidado na mesma instituição. Como entusiasta da investigação e apaixonado pelo fascinante campo da Inteligência Artificial, encontra verdadeira satisfação em explorar os avanços e desafios desta área em constante evolução. Ao longo destes anos, teve a oportunidade de contribuir para o conhecimento neste domínio, tendo escrito vários artigos científicos que abordam questões cruciais e soluções inovadoras em Machine Learning.

Tópicos
de interesse
Detalhes

Detalhes

  • Nome

    Filipe Vamonde Oliveira
  • Cargo

    Assistente de Investigação
  • Desde

    15 fevereiro 2023
002
Publicações

2023

The Impact of Data Selection Strategies on Distributed Model Performance

Autores
Guimarães, M; Oliveira, F; Carneiro, D; Novais, P;

Publicação
Lecture Notes in Networks and Systems

Abstract
Distributed Machine Learning, in which data and learning tasks are scattered across a cluster of computers, is one of the answers of the field to the challenges posed by Big Data. Still, in an era in which data abounds, decisions must still be made regarding which specific data to use on the training of the model, either because the amount of available data is simply too large, or because the training time or complexity of the model must be kept low. Typical approaches include, for example, selection based on data freshness. However, old data are not necessarily outdated and might still contain relevant patterns. Likewise, relying only on recent data may significantly decrease data diversity and representativity, and decrease model quality. The goal of this paper is to compare different heuristics for selecting data in a distributed Machine Learning scenario. Specifically, we ascertain whether selecting data based on their characteristics (meta-features), and optimizing for maximum diversity, improves model quality while, eventually, allowing to reduce model complexity. This will allow to develop more informed data selection strategies in distributed settings, in which the criteria are not only the location of the data or the state of each node in the cluster, but also include intrinsic and relevant characteristics of the data. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

2023

Block size, parallelism and predictive performance: finding the sweet spot in distributed learning

Autores
Oliveira, F; Carneiro, D; Guimaraes, M; Oliveira, O; Novais, P;

Publicação
INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS

Abstract
As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.

2023

Predicting Model Training Time to Optimize Distributed Machine Learning Applications

Autores
Guimaraes, M; Carneiro, D; Palumbo, G; Oliveira, F; Oliveira, O; Alves, V; Novais, P;

Publicação
ELECTRONICS

Abstract
Despite major advances in recent years, the field of Machine Learning continues to face research and technical challenges. Mostly, these stem from big data and streaming data, which require models to be frequently updated or re-trained, at the expense of significant computational resources. One solution is the use of distributed learning algorithms, which can learn in a distributed manner, from distributed datasets. In this paper, we describe CEDEs-a distributed learning system in which models are heterogeneous distributed Ensembles, i.e., complex models constituted by different base models, trained with different and distributed subsets of data. Specifically, we address the issue of predicting the training time of a given model, given its characteristics and the characteristics of the data. Given that the creation of an Ensemble may imply the training of hundreds of base models, information about the predicted duration of each of these individual tasks is paramount for an efficient management of the cluster's computational resources and for minimizing makespan, i.e., the time it takes to train the whole Ensemble. Results show that the proposed approach is able to predict the training time of Decision Trees with an average error of 0.103 s, and the training time of Neural Networks with an average error of 21.263 s. We also show how results depend significantly on the hyperparameters of the model and on the characteristics of the input data.

2023

Dynamic Management of Distributed Machine Learning Projects

Autores
Oliveira, F; Alves, A; Moço, H; Monteiro, J; Oliveira Ó; Carneiro, D; Novais, P;

Publicação
Studies in Computational Intelligence

Abstract
Given the new requirements of Machine Learning problems in the last years, especially in what concerns the volume, diversity and speed of data, new approaches are needed to deal with the associated challenges. In this paper we describe CEDEs - a distributed learning system that runs on top of an Hadoop cluster and takes advantage of blocks, replication and balancing. CEDEs trains models in a distributed manner following the principle of data locality, and is able to change parts of the model through an optimization module, thus allowing a model to evolve over time as the data changes. This paper describes its generic architecture, details the implementation of the first modules, and provides a first validation. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

2021

A Data-Locality-Aware Distributed Learning System

Autores
Carneiro, D; Oliveira, F; Novais, P;

Publicação
Ambient Intelligence - Software and Applications - 12th International Symposium on Ambient Intelligence, ISAmI 2021, Salamanca, Spain, 6-8 October, 2021.

Abstract
Machine Learning problems are significantly growing in complexity, either due to an increase in the volume of data, to new forms of data, or due to the change of data over time. This poses new challenges that are both technical and scientific. In this paper we propose a Distributed Learning System that runs on top of a Hadoop cluster, leveraging its native functionalities. It is guided by the principle of data locality. Data are distributed across the cluster, so models are also distributed and trained in parallel. Models are thus seen as Ensembles of base models, and predictions are made by combining the predictions of the base models. Moreover, models are replicated and distributed across the cluster, so that multiple nodes can answer requests. This results in a system that is both resilient and with high availability. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.