Publicacoes - INESC TEC

Publicações

Publicações por João Gama

2026

DFDT: Dynamic Fast Decision Tree for IoT Data Stream Mining on Edge Devices

Autores
Lourenço, A; Rodrigo, J; Gama, J; Marreiros, G;

Publicação
AAAI

Abstract
The Internet of Things generates massive data streams, with edge computing emerging as a key enabler for online IoT applications and 5G networks. Edge solutions facilitate real-time machine learning inference, but also require continuous adaptation to concept drifts. While extensions of the Very Fast Decision Tree (VFDT) remain state-of-the-art for tabular stream mining, their unregulated growth limit efficiency, particularly in ensemble settings where post-pruning at the individual tree level is seldom applied. This paper presents DFDT, a novel memory-constrained algorithm for online learning. DFDT employs activity-aware pre-pruning, dynamically adjusting splitting criteria based on leaf node activity: low-activity nodes are deactivated to conserve resources, moderately active nodes split under stricter conditions, and highly active nodes leverage a skipping mechanism for accelerated growth. Additionally, adaptive grace periods and tie thresholds allow DFDT to modulate splitting decisions based on observed data variability, enhancing the accu-racy–memory–runtime trade-off while minimizing the need for hyperparameter tuning. An ablation study reveals three DFDT variants suited to different resource profiles. Fully compatible with existing ensemble frameworks, DFDT provides a drop-in alternative to standard VFDT-based learners. © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

FecharLer Abstract

2026

Interpretable rules for online failure prediction: a case study on metro do porto datasets

Autores
Jakobs, M; Veloso, B; Gama, J;

Publicação
INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS

Abstract
Predictive maintenance applications have increasingly been approached with deep learning techniques in recent years due to their high predictive performance. However, as in other real-world application scenarios, the need for explainability is often stated but not sufficiently addressed, which can limit adoption in practice. In this study, we will focus on predicting failures of trains operating in Porto, Portugal. While recent works have found high-performing deep neural network architectures that feature a parallel explainability pipeline, we find that the generated explanations can be hard to comprehend in practice due to their low support over the failure range. In this work, we propose a novel online rule-learning approach that is able to generate simple rules that cover the entirety of the detected failures. We evaluate our method against AMRules, a state-of-the-art online rule-learning approach, on two datasets gathered from trains operated by Metro do Porto. Our experiments show that our approach consistently generates rules with very high support that are simultaneously short and interpretable.

FecharLer Abstract

2025

One-Class Learning for Data Stream Through Graph Neural Networks

Autores
Gôlo, MPS; Gama, J; Marcacini, RM;

Publicação
INTELLIGENT SYSTEMS, BRACIS 2024, PT IV

Abstract
In many data stream applications, there is a normal concept, and the objective is to identify normal and abnormal concepts by training only with normal concept instances. This scenario is known in the literature as one-class learning (OCL) for data streams. In this OCL scenario for data streams, we highlight two main gaps: (i) lack of methods based on graph neural networks (GNNs) and (ii) lack of interpretable methods. We introduce OPENCAST (One-class graPh autoENCoder for dAta STream), a new method for data streams based on OCL and GNNs. Our method learns representations while encapsulating the instances of interest through a hypersphere. OPENCAST learns low-dimensional representations to generate interpretability in the representation learning process. OPENCAST achieved state-of-the-art results for data streams in the OCL scenario, outperforming seven other methods. Furthermore, OPENCAST learns low-dimensional representations, generating interpretability in the representation learning process and results.

FecharLer Abstract

2025

Evaluating Short Text Stream Clustering on Large E-commerce Datasets

Autores
Andrade, C; Ribeiro, RP; Gama, J;

Publicação
INTELLIGENT SYSTEMS, BRACIS 2024, PT III

Abstract
Latent Dirichlet Allocation (LDA) is a fundamental method for clustering short text streams. However, when applied to large datasets, it often faces significant challenges, and its performance is typically evaluated in domain-specific datasets such as news and tweets. This study aims to fill this gap by evaluating the effectiveness of short text clustering methods in a large and diverse e-commerce dataset. We specifically investigate how well these clustering algorithms adapt to the complex dynamics and larger scale of e-commerce text streams, which differ from their usual application domains. Our analysis focuses on the impact of high homogeneity scores on the reported Normalized Mutual Information (NMI) values. We particularly examine whether these scores are inflated due to the prevalence of single-element clusters. To address potential biases in clustering evaluation, we propose using the Akaike Information Criterion (AIC) as an alternative metric to reduce the formation of single-element clusters and provide a more balanced measure of clustering performance. We present new insights for applying short text clustering methodologies in real-world situations, especially in sectors like e-commerce, where text data volumes and dynamics present unique challenges.

FecharLer Abstract

2025

Anomaly Detection in Pet Behavioural Data

Autores
Silva, I; Ribeiro, RP; Gama, J;

Publicação
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II

Abstract
Pet owners are increasingly becoming conscious of their pet's necessities and are paying more attention to their overall wellness. The well-being of their pets is intricately linked to their own emotional and physical well-being. Some veterinary system solutions are emerging to provide proactive healthcare options for pets. One such solution offers the continuous monitoring of a pet's activity through accelerometer tracking devices. Based on data collected by this application, in this paper, we study different time aggregation and three unsupervised machine learning techniques to identify anomalies in pet behaviour data. Specifically, three algorithms, Isolation Forest, Local Outlier Factor, and K-Nearest Neighbour, with various thresholds to differentiate between normal and abnormal events. Results conducted on ten pets (five cats and five dogs) show that the most effective approach is to use daily data divided into periods. Moreover, the Local Outlier Factor is the best algorithm for detecting anomalies when prioritizing the identification of true positives. However, it also produces a high false positive ratio.

FecharLer Abstract

2025

Data Science for Fighting Environmental Crime

Autores
Barbosa, M; Ribeiro, C; Gomes, F; Ribeiro, RP; Gama, J;

Publicação
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II

Abstract
The rise of environmental crimes has become a major concern globally as they cause significant damage to ecosystems, public health and result in economic losses. The availability of vast sensor data provides an opportunity to analyze environmental data proactively. This helps to detect irregularities and uncover potential criminal activities. This paper highlights the critical role played by machine learning (ML) and remote sensing technologies in the continuously evolving scenarios of environmental crime. By examining some case studies on detecting illegal fishing, illegal oil spills, illegal landfills, and illegal logging, we delve into the practical implementation of data-driven approaches for environmental crime detection. Our goal with this study is to provide an overview of the existing research in this area and foster the use of ML and data science techniques to enhance environmental crime detection.

FecharLer Abstract