Publicacoes - INESC TEC

Publicações

Publicações por Carlos Manuel Soares

2024

Machine Learning Data Market Based on Multiagent Systems

Autores
Baghcheband, H; Soares, C; Reis, LP;

Publicação
IEEE INTERNET COMPUTING

Abstract
Today, autonomous agents, the Internet of Things, and smart devices produce more and more distributed data and use them to learn models for different purposes. One challenge is that learning from local data only may lead to suboptimal models. Thus, better models are expected if agents can exchange data, leading to approaches such as federated learning. However, these approaches assume that data have no value and, thus, is exchanged for free. A machine learning data market (MLDM), a framework based on multiagent systems with a market-based perspective on data exchange, was recently proposed. In an MLDM, each agent trains its model based on both local data and data bought from other agents. Although the empirical results are interesting, several challenges are still open, including data acquisition and data valuation. The MLDM is an illustrative example of how the value of data can and should be integrated into the design of distributed ML systems.

FecharLer Abstract

2024

Corrector LSTM: built-in training data correction for improved time-series forecasting

Autores
Baghoussi, Y; Soares, C; Moreira, JM;

Publicação
Neural Comput. Appl.

Abstract
Traditional recurrent neural networks (RNNs) are essential for processing time-series data. However, they function as read-only models, lacking the ability to directly modify the data they learn from. In this study, we introduce the corrector long short-term memory (cLSTM), a Read & Write LSTM architecture that not only learns from the data but also dynamically adjusts it when necessary. The cLSTM model leverages two key components: (a) predicting LSTM’s cell states using Seasonal Autoregressive Integrated Moving Average (SARIMA) and (b) refining the training data based on discrepancies between actual and forecasted cell states. Our empirical validation demonstrates that cLSTM surpasses read-only LSTM models in forecasting accuracy across the Numenta Anomaly Benchmark (NAB) and M4 Competition datasets. Additionally, cLSTM exhibits superior performance in anomaly detection compared to hierarchical temporal memory (HTM) models.

FecharLer Abstract

2024

Shapley-Based Data Valuation Method for the Machine Learning Data Markets (MLDM)

Autores
Baghcheband, H; Soares, C; Reis, LP;

Publicação
FOUNDATIONS OF INTELLIGENT SYSTEMS, ISMIS 2024

Abstract
Data valuation, the process of assigning value to data based on its utility and usefulness, is a critical and largely unexplored aspect of data markets. Within the Machine Learning Data Market (MLDM), a platform that enables data exchange among multiple agents, the challenge of quantifying the value of data becomes particularly prominent. Agents within MLDM are motivated to exchange data based on its potential impact on their individual performance. Shapley Value-based methods have gained traction in addressing this challenge, prompting our study to investigate their effectiveness within the MLDM context. Specifically, we propose the Gain Data Shapley Value (GDSV) method tailored for MLDM and compare it to the original data valuation method used in MLDM. Our analysis focuses on two common learning algorithms, Decision Tree (DT) and K-nearest neighbors (KNN), within a simulated society of five agents, tested on 45 classification datasets. results show that the GDSV leads to incremental improvements in predictive performance across both DT and KNN algorithms compared to performance-based valuation or the baseline. These findings underscore the potential of Shapley Value-based methods in identifying high-value data within MLDM while indicating areas for further improvement.

FecharLer Abstract

2024

Kernel Corrector LSTM

Autores
Tuna, R; Baghoussi, Y; Soares, C; Mendes Moreira, J;

Publicação
ADVANCES IN INTELLIGENT DATA ANALYSIS XXII, PT II, IDA 2024

Abstract
Forecasting methods are affected by data quality issues in two ways: 1. they are hard to predict, and 2. they may affect the model negatively when it is updated with new data. The latter issue is usually addressed by pre-processing the data to remove those issues. An alternative approach has recently been proposed, Corrector LSTM (cLSTM), which is a Read & Write Machine Learning (RW-ML) algorithm that changes the data while learning to improve its predictions. Despite promising results being reported, cLSTM is computationally expensive, as it uses a meta-learner to monitor the hidden states of the LSTM. We propose a new RW-ML algorithm, Kernel Corrector LSTM (KcLSTM), that replaces the meta-learner of cLSTM with a simpler method: Kernel Smoothing. We empirically evaluate the forecasting accuracy and the training time of the new algorithm and compare it with cLSTM and LSTM. Results indicate that it is able to decrease the training time while maintaining a competitive forecasting accuracy.

FecharLer Abstract

2021

Preface

Autores
Soares C.; Torgo L.;

Publicação
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract

2025

Time Series Data Augmentation as an Imbalanced Learning Problem

Autores
Cerqueira, V; Moniz, N; Inácio, R; Soares, C;

Publicação
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2024, PT II

Abstract
Recent state-of-the-art forecasting methods are trained on collections of time series. These methods, often referred to as global models, can capture common patterns in different time series to improve their generalization performance. However, they require large amounts of data that might not be available. Moreover, global models may fail to capture relevant patterns unique to a particular time series. In these cases, data augmentation can be useful to increase the sample size of time series datasets. The main contribution of this work is a novel method for generating univariate time series synthetic samples. Our approach stems from the insight that the observations concerning a particular time series of interest represent only a small fraction of all observations. In this context, we frame the problem of training a forecasting model as an imbalanced learning task. Oversampling strategies are popular approaches used to handle the imbalance problem in machine learning. We use these techniques to create synthetic time series observations and improve the accuracy of forecasting models. We carried out experiments using 7 different databases that contain a total of 5502 univariate time series. We found that the proposed solution outperforms both a global and a local model, thus providing a better trade-off between these two approaches.

FecharLer Abstract