Publications

Publications by Miguel Ângelo Guimarães

2023

Predicting Model Training Time to Optimize Distributed Machine Learning Applications

Authors
Guimaraes, M; Carneiro, D; Palumbo, G; Oliveira, F; Oliveira, O; Alves, V; Novais, P;

Publication
ELECTRONICS

Abstract
Despite major advances in recent years, the field of Machine Learning continues to face research and technical challenges. Mostly, these stem from big data and streaming data, which require models to be frequently updated or re-trained, at the expense of significant computational resources. One solution is the use of distributed learning algorithms, which can learn in a distributed manner, from distributed datasets. In this paper, we describe CEDEs-a distributed learning system in which models are heterogeneous distributed Ensembles, i.e., complex models constituted by different base models, trained with different and distributed subsets of data. Specifically, we address the issue of predicting the training time of a given model, given its characteristics and the characteristics of the data. Given that the creation of an Ensemble may imply the training of hundreds of base models, information about the predicted duration of each of these individual tasks is paramount for an efficient management of the cluster's computational resources and for minimizing makespan, i.e., the time it takes to train the whole Ensemble. Results show that the proposed approach is able to predict the training time of Decision Trees with an average error of 0.103 s, and the training time of Neural Networks with an average error of 21.263 s. We also show how results depend significantly on the hyperparameters of the model and on the characteristics of the input data.

CloseRead Abstract

2023

Using meta-learning to predict performance metrics in machine learning problems

Authors
Carneiro, D; Guimaraes, M; Carvalho, M; Novais, P;

Publication
EXPERT SYSTEMS

Abstract
Machine learning has been facing significant challenges over the last years, much of which stem from the new characteristics of machine learning problems, such as learning from streaming data or incorporating human feedback into existing datasets and models. In these dynamic scenarios, data change over time and models must adapt. However, new data do not necessarily mean new patterns. The main goal of this paper is to devise a method to predict a model's performance metrics before it is trained, in order to decide whether it is worth it to train it or not. That is, will the model hold significantly better results than the current one? To address this issue, we propose the use of meta-learning. Specifically, we evaluate two different meta-models, one built for a specific machine learning problem, and another built based on many different problems, meant to be a generic meta-model, applicable to virtually any problem. In this paper, we focus only on the prediction of the root mean square error (RMSE). Results show that it is possible to accurately predict the RMSE of future models, event in streaming scenarios. Moreover, results also show that it is possible to reduce the need for re-training models between 60% and 98%, depending on the problem and on the threshold used.

CloseRead Abstract

2021

A Meta-Learning Approach to Error Prediction

Authors
Guimaraes, M; Carneiro, D;

Publication
PROCEEDINGS OF 2021 16TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI'2021)

Abstract
Machine Learning is one of the most trending topics nowadays. The reason is of course for being more and more present in our everyday life, even if we do not notice it. What goes even more unnoticed is the fact that every Machine Learning model needs computational power. And of course, it also needs data. But how many data are necessary to build the best Machine Learning model possible, and how many times do we need to retrain a model so that it does not become obsolete as data change? That kind of questions are the ones that can reduce unnecessary costs to a company. In this paper we propose a novel approach to predict the performance of a model given some characteristics of the data, that are called meta-features. The goal is, indeed, to only train a new model when some error metric (e.g., RMSE) is expected to decrease substantially compared with a previously trained model. This approach is best applied in scenarios of data streaming or in Big Data, as well on Interactive Machine Learning scenarios. We validate it on a real Fraud Detection case and this scenario is also briefly described.

CloseRead Abstract

2021

Optimizing Model Training in Interactive Learning Scenarios

Authors
Carneiro, D; Guimarães, M; Carvalho, M; Novais, P;

Publication
Trends and Applications in Information Systems and Technologies - Volume 1, WorldCIST 2021, Terceira Island, Azores, Portugal, 30 March - 2 April, 2021.

Abstract
In the last years, developments in data collection, storing, processing and analysis technologies resulted in an unprecedented use of data by organizations. The volume and variety of data, combined with the velocity at which decisions must now be taken and the dynamism of business environments, pose new challenges to Machine Learning. Namely, algorithms must now deal with streaming data, concept drift, distributed datasets, among others. One common task nowadays is to update or re-train models when data changes, as opposed to traditional one-shot batch systems, in which the model is trained only once. This paper addresses the issue of when to update or re-train a model, by proposing an approach to predict the performance metrics of the model if it were trained at a given moment, with a specific set of data. We validate the proposed approach in an interactive Machine Learning system in the domain of fraud detection. © 2021, The Author(s), under exclusive license to Springer Nature Switzerland AG.

CloseRead Abstract

2020

Optimizing Instance Selection Strategies in Interactive Machine Learning: An Application to Fraud Detection

Authors
Carneiro, D; Guimarães, M; Sousa, M;

Publication
Hybrid Intelligent Systems - 20th International Conference on Hybrid Intelligent Systems (HIS 2020), Virtual Event, India, December 14-16, 2020

Abstract
Machine Learning systems are generally thought of as fully automatic. However, in recent years, interactive systems in which Human experts actively contribute towards the learning process have shown improved performance when compared to fully automated ones. This may be so in scenarios of Big Data, scenarios in which the input is a data stream, or when there is concept drift. In this paper we present a system for supporting auditors in the task of financial fraud detection. The system is interactive in the sense that the auditors can provide feedback regarding the instances of the data they use, or even suggest new variables. This feedback is incorporated into newly trained Machine Learning models which improve over time. In this paper we show that the order by which instances are evaluated by the auditors, and their feedback incorporated, influences the evolution of the performance of the system over time. The goal of this paper is to study of different instance selection strategies for Human evaluation and feedback can improve the learning speed. This information can then be used by the system to determine, at each moment, which instances would improve the system the most, so that these can be suggested to the users for validation. © 2021, The Author(s), under exclusive license to Springer Nature Switzerland AG.

CloseRead Abstract

2022

Continuously Learning from User Feedback

Authors
Carneiro, D; Sousa, M; Palumbo, G; Guimaraes, M; Carvalho, M; Novais, P;

Publication
INFORMATION SYSTEMS AND TECHNOLOGIES, WORLDCIST 2022, VOL 1

Abstract
Machine Learning has been evolving rapidly over the past years, with new algorithms and approaches being devised to solve the challenges that the new properties of data pose. Specifically, algorithms must now learn continuously and in real time, from very large and possibly distributed sets of data. In this paper we describe a learning system that tackles some of these novel challenges. It learns and adapts in realtime by continuously incorporating user feedback, in a fully autonomous way. Moreover, it allows for users to manage features (e.g. add, edit, remove), reflecting these changes on-the-fly in the Machine Learning pipeline. The paper describes some of the main functionalities of the system, which despite being of general-purpose, is being developed in the context of a project in the domain of financial fraud detection.

CloseRead Abstract