Publications

Publications by Miguel Ângelo Guimarães

2022

A predictive and user-centric approach to Machine Learning in data streaming scenarios

Authors
Carneiro, D; Guimaraes, M; Silva, F; Novais, P;

Publication
NEUROCOMPUTING

Abstract
Machine Learning has emerged in the last years as the main solution to many of nowadays' data-based decision problems. However, while new and more powerful algorithms and the increasing availability of computational resources contributed to a widespread use of Machine Learning, significant challenges still remain. Two of the most significant nowadays are the need to explain a model's predictions, and the significant costs of training and re-training models, especially with large datasets or in streaming scenarios. In this paper we address both issues by proposing an approach we deem predictive and user-centric. It is predictive in the sense that it estimates the benefit of re-training a model with new data, and it is user centric in the sense that it implements an explainable interface that produces interpretable explanations that accompany predictions. The former allows to reduce necessary resources (e.g. time, costs) spent on re-training models when no improvements are expected, while the latter allows for human users to have additional information to support decision-making. We validate the proposed approach with a group of public datasets and present a real application scenario.

CloseRead Abstract

2021

A Conversational Interface for interacting with Machine Learning models

Authors
Carneiro, D; Veloso, P; Guimarães, M; Baptista, J; Sousa, M;

Publication
Proceedings of 4th International Workshop on eXplainable and Responsible AI and Law co-located with 18th International Conference on Artificial Intelligence and Law (ICAIL 2021), Virtual Event, Sao Paolo, Brazil, June 21, 2021.

Abstract

2023

The Impact of Data Selection Strategies on Distributed Model Performance

Authors
Guimarães, M; Oliveira, F; Carneiro, D; Novais, P;

Publication
Lecture Notes in Networks and Systems

Abstract
Distributed Machine Learning, in which data and learning tasks are scattered across a cluster of computers, is one of the answers of the field to the challenges posed by Big Data. Still, in an era in which data abounds, decisions must still be made regarding which specific data to use on the training of the model, either because the amount of available data is simply too large, or because the training time or complexity of the model must be kept low. Typical approaches include, for example, selection based on data freshness. However, old data are not necessarily outdated and might still contain relevant patterns. Likewise, relying only on recent data may significantly decrease data diversity and representativity, and decrease model quality. The goal of this paper is to compare different heuristics for selecting data in a distributed Machine Learning scenario. Specifically, we ascertain whether selecting data based on their characteristics (meta-features), and optimizing for maximum diversity, improves model quality while, eventually, allowing to reduce model complexity. This will allow to develop more informed data selection strategies in distributed settings, in which the criteria are not only the location of the data or the state of each node in the cluster, but also include intrinsic and relevant characteristics of the data. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

CloseRead Abstract

2023

Block size, parallelism and predictive performance: finding the sweet spot in distributed learning

Authors
Oliveira, F; Carneiro, D; Guimaraes, M; Oliveira, O; Novais, P;

Publication
INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS

Abstract
As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.

CloseRead Abstract

2023

Predicting and explaining absenteeism risk in hospital patients before and during COVID-19

Authors
Borges, A; Carvalho, M; Maia, M; Guimaraes, M; Carneiro, D;

Publication
SOCIO-ECONOMIC PLANNING SCIENCES

Abstract
In order to address one of the most challenging problems in hospital management - patients' absenteeism without prior notice - this study analyses the risk factors associated with this event. To this end, through real data from a hospital located in the North of Portugal, a prediction model previously validated in the literature is used to infer absenteeism risk factors, and an explainable model is proposed, based on a modified CART algorithm. The latter intends to generate a human-interpretable explanation for patient absenteeism, and its implementation is described in detail. Furthermore, given the significant impact, the COVID-19 pandemic had on hospital management, a comparison between patients' profiles upon absenteeism before and during the COVID-19 pandemic situation is performed. Results obtained differ between hospital specialities and time periods meaning that patient profiles on absenteeism change during pandemic periods and within specialities.

CloseRead Abstract

2023

Predicting Model Training Time to Optimize Distributed Machine Learning Applications

Authors
Guimaraes, M; Carneiro, D; Palumbo, G; Oliveira, F; Oliveira, O; Alves, V; Novais, P;

Publication
ELECTRONICS

Abstract
Despite major advances in recent years, the field of Machine Learning continues to face research and technical challenges. Mostly, these stem from big data and streaming data, which require models to be frequently updated or re-trained, at the expense of significant computational resources. One solution is the use of distributed learning algorithms, which can learn in a distributed manner, from distributed datasets. In this paper, we describe CEDEs-a distributed learning system in which models are heterogeneous distributed Ensembles, i.e., complex models constituted by different base models, trained with different and distributed subsets of data. Specifically, we address the issue of predicting the training time of a given model, given its characteristics and the characteristics of the data. Given that the creation of an Ensemble may imply the training of hundreds of base models, information about the predicted duration of each of these individual tasks is paramount for an efficient management of the cluster's computational resources and for minimizing makespan, i.e., the time it takes to train the whole Ensemble. Results show that the proposed approach is able to predict the training time of Decision Trees with an average error of 0.103 s, and the training time of Neural Networks with an average error of 21.263 s. We also show how results depend significantly on the hyperparameters of the model and on the characteristics of the input data.

CloseRead Abstract