Publicacoes - INESC TEC

Publicações

Publicações por LIAAD

2025

Fine-Tuning Transformer-Based LLMs in Hierarchical Text Classification

Autores
Santos, J; Silva, N; Ferreira, C; Gama, J;

Publicação
DISCOVERY SCIENCE, DS 2025

Abstract
Hierarchical document classification is essential for structuring large-scale textual corpora in domains such as digital libraries and academic repositories. While recent advances in large language models (LLMs) have opened new possibilities for text classification, their applicability to hierarchical settings under real-world constraints remains underexplored. This study investigates both generative and discriminative transformer-based models, evaluating their effectiveness across multiple inference strategies: zero-shot baseline, local fine-tuning, and a global approach using category-specific models. Experiments on two real-world hierarchical datasets provide a comprehensive comparison of classification accuracy, F1-macro scores, and inference times. The results highlight that, although generative LLMs can deliver competitive (yet variable) performance at higher levels of the hierarchy, their high inference costs hinder their use in time-sensitive applications. In contrast, fine-tuned discriminative models-particularly BERT-based architectures-consistently offer a more favorable trade-off between performance and efficiency.

FecharLer Abstract

2025

RMIDDM: an unsupervised and interpretable concept drift detection method for data streams

Autores
Neto, R; Alencar, B; Gomes, HM; Bifet, A; Gama, J; Cassales, G; Rios, R;

Publicação
DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
Traditional machine learning techniques assume that data is drawn from a stationary source. This assumption is challenged in contexts with data streams for presenting constant and potentially infinite sequences whose distribution is prone to change over time. Based on these settings, detecting changes (a.k.a. concept drifts) is necessary to keep learning models up-to-date. Although state-of-the-art detection methods were designed to monitor the loss of predictive models, such monitoring falls short in many real-world scenarios where the true labels are not readily available. Therefore, there is increasing attention to unsupervised concept drift detection methods as approached in this paper. In this work, we present an unsupervised and interpretable method based on Radial Basis Function Networks (RBFN) and Markov Chains (MC), referred to as RMIDDM (Radial Markov Interpretable Drift Detection Method). In our method, RBF performs, in the intermediate layer, an activation process that implicitly produces groups of observations collected over time. Simultaneously, MC models the transitions between groups to support the detection of concept drifts, which happens when the active group changes and its probability exceeds a given threshold. A set of experiments with synthetic datasets and comparisons with state-of-the-art algorithms demonstrated that the proposed method can detect drifts at runtime in an efficient, interpretable, and independent way of labels, presenting competitive results and behavior. Additionally, to show its applicability in a real-world scenario, we analyzed new COVID-19 cases, deaths, and vaccinations to identify new waves as concept drifts and generate Markov models that allow understanding of their interaction.

FecharLer Abstract

2025

Effect of AI on Innovation Capacity in the context of Industry 5.0: Findings from a Qualitative study

Autores
Bécue, A; Gama, J; Brito, PQ;

Publicação
Strategic Business Research

Abstract

2025

A Systematic Literature Review on Multi-label Data Stream Classification

Autores
Oliveira, HF; de Faria, ER; Gama, J; Khan, L; Cerri, R;

Publicação
CoRR

Abstract

2025

Salvador Urban Network Transportation (SUNT): A Landmark Spatiotemporal Dataset for Public Transportation

Autores
Ferreira, MV; Souza, M; Rios, TN; Fernandes, IFC; Nery, J; Gama, J; Bifet, A; Rios, RA;

Publicação
SCIENTIFIC DATA

Abstract
Efficient public transportation management is essential for the development of large urban centers, providing several benefits such as comprehensive coverage of population mobility, reduction of transport costs, better control of traffic congestion, and significant reduction of environmental impact limiting gas emissions and pollution. Realizing these benefits requires a deeply understanding the population and transit patterns and the adoption of approaches to model multiple relations and characteristics efficiently. This work addresses these challenges by providing a novel dataset that includes various public transportation components from three different systems: regular buses, subway, and BRT (Bus Rapid Transit). Our dataset comprises daily information from about 700,000 passengers in Salvador, one of Brazil's largest cities, and local public transportation data with approximately 2,000 vehicles operating across nearly 400 lines, connecting almost 3,000 stops and stations. With data collected from March 2024 to March 2025 at a frequency lower than one minute, SUNT stands as one of the largest, most comprehensive, and openly available urban datasets in the literature.

FecharLer Abstract

2025

Data Science: Foundations and Applications - 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2025, Sydney, Australia, June 10-13, 2025, Proceedings, Part VII

Autores
Wu, X; Spiliopoulou, M; Wang, C; Kumar, V; Cao, L; Zhou, X; Pang, G; Gama, J;

Publicação
PAKDD (7)

Abstract