Publications

Publications by João Gama

2026

In-context Learning of Evolving Data Streams with Tabular Foundational Models

Authors
Lourenço, A; Gama, J; Xing, EP; Marreiros, G;

Publication
KDD (1)

Abstract
State-of-the-art data stream mining has long drawn from ensembles of the Very Fast Decision Tree, a seminal algorithm honored with the 2015 KDD Test-of-Time Award. However, the emergence of large tabular models, i.e., transformers designed for structured numerical data, marks a significant paradigm shift. These models move beyond traditional weight updates, instead employing in-context learning through prompt tuning. By using on-the-fly sketches to summarize unbounded streaming data, one can feed this information into a pre-trained model for efficient processing. This work bridges advancements from both areas, highlighting how transformers' implicit meta-learning abilities, pre-training on drifting natural data, and reliance on context optimization directly address the core challenges of adaptive learning in dynamic environments. Exploring real-time model adaptation, this research demonstrates that TabPFN, coupled with a simple sliding memory strategy, consistently outperforms ensembles of Hoeffding trees, such as Adaptive Random Forest, and Streaming Random Patches, across all non-stationary benchmarks. © 2026 Owner/Author.

CloseRead Abstract

2026

DFDT: Dynamic Fast Decision Tree for IoT Data Stream Mining on Edge Devices

Authors
Lourenço, A; Rodrigo, J; Gama, J; Marreiros, G;

Publication
AAAI

Abstract
The Internet of Things generates massive data streams, with edge computing emerging as a key enabler for online IoT applications and 5G networks. Edge solutions facilitate real-time machine learning inference, but also require continuous adaptation to concept drifts. While extensions of the Very Fast Decision Tree (VFDT) remain state-of-the-art for tabular stream mining, their unregulated growth limit efficiency, particularly in ensemble settings where post-pruning at the individual tree level is seldom applied. This paper presents DFDT, a novel memory-constrained algorithm for online learning. DFDT employs activity-aware pre-pruning, dynamically adjusting splitting criteria based on leaf node activity: low-activity nodes are deactivated to conserve resources, moderately active nodes split under stricter conditions, and highly active nodes leverage a skipping mechanism for accelerated growth. Additionally, adaptive grace periods and tie thresholds allow DFDT to modulate splitting decisions based on observed data variability, enhancing the accu-racy–memory–runtime trade-off while minimizing the need for hyperparameter tuning. An ablation study reveals three DFDT variants suited to different resource profiles. Fully compatible with existing ensemble frameworks, DFDT provides a drop-in alternative to standard VFDT-based learners. © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

CloseRead Abstract

2026

Interpretable rules for online failure prediction: a case study on metro do porto datasets

Authors
Jakobs, M; Veloso, B; Gama, J;

Publication
INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS

Abstract
Predictive maintenance applications have increasingly been approached with deep learning techniques in recent years due to their high predictive performance. However, as in other real-world application scenarios, the need for explainability is often stated but not sufficiently addressed, which can limit adoption in practice. In this study, we will focus on predicting failures of trains operating in Porto, Portugal. While recent works have found high-performing deep neural network architectures that feature a parallel explainability pipeline, we find that the generated explanations can be hard to comprehend in practice due to their low support over the failure range. In this work, we propose a novel online rule-learning approach that is able to generate simple rules that cover the entirety of the detected failures. We evaluate our method against AMRules, a state-of-the-art online rule-learning approach, on two datasets gathered from trains operated by Metro do Porto. Our experiments show that our approach consistently generates rules with very high support that are simultaneously short and interpretable.

CloseRead Abstract

2025

One-Class Learning for Data Stream Through Graph Neural Networks

Authors
Gôlo, MPS; Gama, J; Marcacini, RM;

Publication
INTELLIGENT SYSTEMS, BRACIS 2024, PT IV

Abstract
In many data stream applications, there is a normal concept, and the objective is to identify normal and abnormal concepts by training only with normal concept instances. This scenario is known in the literature as one-class learning (OCL) for data streams. In this OCL scenario for data streams, we highlight two main gaps: (i) lack of methods based on graph neural networks (GNNs) and (ii) lack of interpretable methods. We introduce OPENCAST (One-class graPh autoENCoder for dAta STream), a new method for data streams based on OCL and GNNs. Our method learns representations while encapsulating the instances of interest through a hypersphere. OPENCAST learns low-dimensional representations to generate interpretability in the representation learning process. OPENCAST achieved state-of-the-art results for data streams in the OCL scenario, outperforming seven other methods. Furthermore, OPENCAST learns low-dimensional representations, generating interpretability in the representation learning process and results.

CloseRead Abstract

2025

Evaluating Short Text Stream Clustering on Large E-commerce Datasets

Authors
Andrade, C; Ribeiro, RP; Gama, J;

Publication
INTELLIGENT SYSTEMS, BRACIS 2024, PT III

Abstract
Latent Dirichlet Allocation (LDA) is a fundamental method for clustering short text streams. However, when applied to large datasets, it often faces significant challenges, and its performance is typically evaluated in domain-specific datasets such as news and tweets. This study aims to fill this gap by evaluating the effectiveness of short text clustering methods in a large and diverse e-commerce dataset. We specifically investigate how well these clustering algorithms adapt to the complex dynamics and larger scale of e-commerce text streams, which differ from their usual application domains. Our analysis focuses on the impact of high homogeneity scores on the reported Normalized Mutual Information (NMI) values. We particularly examine whether these scores are inflated due to the prevalence of single-element clusters. To address potential biases in clustering evaluation, we propose using the Akaike Information Criterion (AIC) as an alternative metric to reduce the formation of single-element clusters and provide a more balanced measure of clustering performance. We present new insights for applying short text clustering methodologies in real-world situations, especially in sectors like e-commerce, where text data volumes and dynamics present unique challenges.

CloseRead Abstract

2025

Anomaly Detection in Pet Behavioural Data

Authors
Silva, I; Ribeiro, RP; Gama, J;

Publication
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II

Abstract
Pet owners are increasingly becoming conscious of their pet's necessities and are paying more attention to their overall wellness. The well-being of their pets is intricately linked to their own emotional and physical well-being. Some veterinary system solutions are emerging to provide proactive healthcare options for pets. One such solution offers the continuous monitoring of a pet's activity through accelerometer tracking devices. Based on data collected by this application, in this paper, we study different time aggregation and three unsupervised machine learning techniques to identify anomalies in pet behaviour data. Specifically, three algorithms, Isolation Forest, Local Outlier Factor, and K-Nearest Neighbour, with various thresholds to differentiate between normal and abnormal events. Results conducted on ten pets (five cats and five dogs) show that the most effective approach is to use daily data divided into periods. Moreover, the Local Outlier Factor is the best algorithm for detecting anomalies when prioritizing the identification of true positives. However, it also produces a high false positive ratio.

CloseRead Abstract