2023
Authors
Palumbo, G; Carneiro, D; Guimares, M; Alves, V; Novais, P;
Publication
INTERNATIONAL JOURNAL OF NEURAL SYSTEMS
Abstract
In the last years, the number of machine learning algorithms and their parameters has increased significantly. On the one hand, this increases the chances of finding better models. On the other hand, it increases the complexity of the task of training a model, as the search space expands significantly. As the size of datasets also grows, traditional approaches based on extensive search start to become prohibitively expensive in terms of computational resources and time, especially in data streaming scenarios. This paper describes an approach based on meta-learning that tackles two main challenges. The first is to predict key performance indicators of machine learning models. The second is to recommend the best algorithm/configuration for training a model for a given machine learning problem. When compared to a state-of-the-art method (AutoML), the proposed approach is up to 130x faster and only 4% worse in terms of average model quality. Hence, it is especially suited for scenarios in which models need to be updated regularly, such as in streaming scenarios with big data, in which some accuracy can be traded for a much shorter model training time.
2022
Authors
Carneiro, D; Guimaraes, M; Silva, F; Novais, P;
Publication
NEUROCOMPUTING
Abstract
Machine Learning has emerged in the last years as the main solution to many of nowadays' data-based decision problems. However, while new and more powerful algorithms and the increasing availability of computational resources contributed to a widespread use of Machine Learning, significant challenges still remain. Two of the most significant nowadays are the need to explain a model's predictions, and the significant costs of training and re-training models, especially with large datasets or in streaming scenarios. In this paper we address both issues by proposing an approach we deem predictive and user-centric. It is predictive in the sense that it estimates the benefit of re-training a model with new data, and it is user centric in the sense that it implements an explainable interface that produces interpretable explanations that accompany predictions. The former allows to reduce necessary resources (e.g. time, costs) spent on re-training models when no improvements are expected, while the latter allows for human users to have additional information to support decision-making. We validate the proposed approach with a group of public datasets and present a real application scenario.
2021
Authors
Carneiro, D; Veloso, P; Guimarães, M; Baptista, J; Sousa, M;
Publication
Proceedings of 4th International Workshop on eXplainable and Responsible AI and Law co-located with 18th International Conference on Artificial Intelligence and Law (ICAIL 2021), Virtual Event, Sao Paolo, Brazil, June 21, 2021.
Abstract
2023
Authors
Guimarães, M; Oliveira, F; Carneiro, D; Novais, P;
Publication
Ambient Intelligence - Software and Applications - 14th International Symposium on Ambient Intelligence, ISAmI 2023, Guimarães, Portugal, July 12-14, 2023
Abstract
Distributed Machine Learning, in which data and learning tasks are scattered across a cluster of computers, is one of the answers of the field to the challenges posed by Big Data. Still, in an era in which data abounds, decisions must still be made regarding which specific data to use on the training of the model, either because the amount of available data is simply too large, or because the training time or complexity of the model must be kept low. Typical approaches include, for example, selection based on data freshness. However, old data are not necessarily outdated and might still contain relevant patterns. Likewise, relying only on recent data may significantly decrease data diversity and representativity, and decrease model quality. The goal of this paper is to compare different heuristics for selecting data in a distributed Machine Learning scenario. Specifically, we ascertain whether selecting data based on their characteristics (meta-features), and optimizing for maximum diversity, improves model quality while, eventually, allowing to reduce model complexity. This will allow to develop more informed data selection strategies in distributed settings, in which the criteria are not only the location of the data or the state of each node in the cluster, but also include intrinsic and relevant characteristics of the data. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.
2024
Authors
Oliveira, F; Carneiro, D; Guimaraes, M; Oliveira, O; Novais, P;
Publication
INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS
Abstract
As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.
2023
Authors
Borges, A; Carvalho, M; Maia, M; Guimaraes, M; Carneiro, D;
Publication
SOCIO-ECONOMIC PLANNING SCIENCES
Abstract
In order to address one of the most challenging problems in hospital management - patients' absenteeism without prior notice - this study analyses the risk factors associated with this event. To this end, through real data from a hospital located in the North of Portugal, a prediction model previously validated in the literature is used to infer absenteeism risk factors, and an explainable model is proposed, based on a modified CART algorithm. The latter intends to generate a human-interpretable explanation for patient absenteeism, and its implementation is described in detail. Furthermore, given the significant impact, the COVID-19 pandemic had on hospital management, a comparison between patients' profiles upon absenteeism before and during the COVID-19 pandemic situation is performed. Results obtained differ between hospital specialities and time periods meaning that patient profiles on absenteeism change during pandemic periods and within specialities.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.