Publicacoes - INESC TEC

Publicações

Publicações por Rita Paula Ribeiro

2026

Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Porto, Portugal, September 15-19, 2025, Proceedings, Part IV

Autores
Ribeiro, RP; Pfahringer, B; Japkowicz, N; Larrañaga, P; Jorge, AM; Soares, C; Abreu, PH; Gama, J;

Publicação
ECML/PKDD (4)

Abstract

2026

Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Porto, Portugal, September 15-19, 2025, Proceedings, Part I

Autores
Ribeiro, RP; Pfahringer, B; Japkowicz, N; Larrañaga, P; Jorge, AM; Soares, C; Abreu, PH; Gama, J;

Publicação
ECML/PKDD (1)

Abstract

2026

CARTGen-IR: Synthetic Tabular Data Generation for Imbalanced Regression

Autores
Pinheiro, AP; Ribeiro, RP;

Publicação
IDA

Abstract
Handling imbalanced target distributions in regression poses a persistent challenge, as the underrepresentation of relevant target values can significantly hinder model performance. Existing data-level solutions often adapt classification-oriented techniques, introducing arbitrary thresholds over the continuous target and leading to artificial and potentially misleading problem formulations. Deep generative models offer flexible sample synthesis but are computationally intensive and difficult to interpret. We propose a CART-based synthetic sampling method specifically designed for imbalanced regression on tabular data. The method integrates relevance- and density-guided sampling to address sparse target regions without thresholding, and employs a feature-driven tree structure to generate realistic tabular samples across heterogeneous features and non-linear interactions. Experiments on benchmark datasets for extreme-value prediction show that the proposed approach is competitive with state-of-the-art resampling and generative methods while offering faster execution and greater transparency. These results highlight its potential as a scalable and interpretable data-level strategy for improving regression models in imbalanced domains. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

FecharLer Abstract

2025

Efficient Instance Selection in Tree-Based Models for Data Streams Classification

Autores
Paim, AM; Gama, J; Veloso, B; Enembreck, F; Ribeiro, RP;

Publicação
40TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING

Abstract
The learning from continuous data streams is a relevant area within machine learning, focusing on the creation and updating of predictive models in real time as new data becomes available for training and prediction. Among the most widely used methods for this type of task, Hoeffding Trees are highly valued for their simplicity and robustness across a variety of applications and are considered the primary choice for generating decision trees in data stream contexts. However, Hoeffding Trees tend to continuously expand as new data is incorporated, resulting in increased processing time and memory consumption, often without providing significant gains in accuracy. In this study, we propose an instance selection scheme that combines different strategies to regularize Hoeffding Trees and their variants, mitigating excessive growth without compromising model accuracy. The method selects misclassified instances and a fraction of correctly classified instances during the training phase. After extensive experimental evaluation, the instance selection scheme demonstrates superior predictive performance compared to the original models (without selection), for both real and synthetic datasets for data streams, using a reduced subset of examples. Additionally, the method achieves relevant improvements in processing time, model complexity, and memory consumption, highlighting the effectiveness of the proposed instance selection scheme.

FecharLer Abstract

2025

Network-Based Anomaly Detection in Waste Transportation Data

Autores
Shaji, N; Tabassum, S; Ribeiro, RP; Gama, J; Santana, P; Garcia, A;

Publicação
COMPLEX NETWORKS & THEIR APPLICATIONS XIII, COMPLEX NETWORKS 2024, VOL 1

Abstract
Waste transport management is a critical sector where maintaining accurate records and preventing fraudulent or illegal activities is essential for regulatory compliance, environmental protection, and public safety. However, monitoring and analyzing large-scale waste transport records to identify suspicious patterns or anomalies is a complex task. These records often involve multiple entities and exhibit variability in waste flows between them. Traditional anomaly detection methods relying solely on individual transaction data, may struggle to capture the deeper, network-level anomalies that emerge from the interactions between entities. To address this complexity, we propose a hybrid approach that integrates network-based measures with machine learning techniques for anomaly detection in waste transport data. Our method leverages advanced graph analysis techniques, such as sub-graph detection, community structure analysis, and centrality measures, to extract meaningful features that describe the network's topology. We also introduce novel metrics for edge weight disparities. Further, advanced machine learning techniques, including clustering, neural network, density-based, and ensemble methods are applied to these structural features to enhance and refine the identification of anomalous behaviors.

FecharLer Abstract

2025

Screening Urban Soil Contamination in Rome: Insights from XRF and Multivariate Analysis

Autores
Chandramohan, MS; da Silva, IM; Ribeiro, RP; Jorge, A; da Silva, JE;

Publicação
ENVIRONMENTS

Abstract
This study investigates spatial distribution and chemical elemental composition screening in soils in Rome (Italy) using X-ray fluorescence analysis. Fifty-nine soil samples were collected from various locations within the urban areas of the Rome municipality and were analyzed for 19 elements. Multivariate statistical techniques, including nonlinear mapping, principal component analysis, and hierarchical cluster analysis, were employed to identify clusters of similar soil samples and their spatial distribution and to try to obtain environmental quality information. The soil sample clusters result from natural geological processes and anthropogenic activities on soil contamination patterns. Spatial clustering using the k-means algorithm further identified six distinct clusters, each with specific geographical distributions and elemental characteristics. Hence, the findings underscore the importance of targeted soil assessments to ensure the sustainable use of land resources in urban areas.

FecharLer Abstract