Publications

Publications by LIAAD

2026

Turning web data into official statistics: Classifying Portuguese retail products with NLP models

Authors
Machado, JDU; Veloso, B;

Publication
STATISTICAL JOURNAL OF THE IAOS

Abstract
The growing availability of online data creates new opportunities to improve the timeliness and detail of official statistics, particularly in domains such as price monitoring and inflation measurement. However, leveraging web-scraped data for official use requires alignment with standardized classification frameworks such as the European Classification of Individual Consumption According to Purpose (ECOICOP). We train two natural-language models, a lightweight convolutional neural network (CNN) and a fine-tuned BERTimbau transformer, to classify Portuguese food and beverage items into ECOICOP categories. Using 100,000 product titles scraped from six national supermarket sites and labeled via a human-in-the-loop workflow, the CNN reaches a macro-F1 of 92.19 % with minimal computing cost, while the transformer attains 94.00 %, the first such result for Portuguese. Both models are published on Hugging Face, enabling reproducible inference at scale while the source data remain confidential. The study delivers the first open-source Portuguese ECOICOP classifiers for food and beverage products, a replicable low-resource labeling workflow, and a benchmark of accuracy-speed trade-offs to guide researchers in similar tasks.

CloseRead Abstract

2026

A Parametric Information-gain to Improve Online Tree-based Machine Learning Models

Authors
Costa, VV; Costa, D; Veloso, B; Rocha, EM;

Publication

Abstract
Decision trees are a cornerstone of interpretable machine learning and are widely used for their simplicity and effectiveness in classification tasks. To address the growing need for models that can operate on continuous, unbounded data, decision trees have been reinvented for the data stream setting, where they must learn incrementally under constraints such as limited memory, evolving distributions, and delayed supervision. A critical component of these tree-based models, particularly those based on the Hoeffding Trees, is the split criterion, which determines how the input space is partitioned. This study introduces a new split criterion for stream-based Hoeffding trees, based on a unified five-parameter entropic formulation that generalizes several well-known measures, including Shannon, Gini, Tsallis, and Rényi entropies. While such formulations have been explored in batch learning, their application to streaming scenarios has not been made. By incorporating this criterion into a variety of established streaming classifiers and evaluating performance on standard benchmark datasets, we demonstrate consistent and statistically significant improvements over existing methods, including those implemented in the River library. Notably, we report gains of up to 40% in immediate evaluation metrics, along with consistent wins and some draws on the prequential Macro-F1, with no observed losses against baseline criteria. The generality of the approach introduces additional computational overhead and also enables greater expressiveness and adaptability in handling uncertainty and nonstationary data. This work advances the integration of information-theoretic principles into online learning and highlights the importance of efficient hyperparameter tuning and adaptive entropy selection in streaming environments.

CloseRead Abstract

2026

Deep neural networks in medical microbiology for bacterial colonies classification

Authors
Pereira, JD; Veloso, B; Gama, J;

Publication
SCIENTIFIC REPORTS

Abstract
While automation has transformed many areas inside clinical laboratories, microbiology still relies heavily on manual tasks, particularly the culture of samples on agar plates and their subsequent manual review for microorganism identification and antibiotic susceptibility profiling. Bacterial colony detection and classification require trained professionals, making the process time-consuming and prone to human error. Developing deep learning models to automate these tasks could improve microbiology workflows and accelerate clinical decision-making. In this study we trained and evaluated five object detection architectures (Faster R-CNN and RetinaNet with ResNet-50 and ResNet-101 backbones, and YOLOv8) on the Annotated Germs for Automated Recognition (AGAR) dataset for bacterial colony classification. Transfer learning, cross-subset generalization, and Weighted Box Fusion (WBF) ensemble methods were applied to enhance and characterize performance. Additionally, we created and publicly released a curated dataset of 165 agar plate images containing colonies of S. aureus, P. aeruginosa, and E. coli cultured across four distinct culture media. YOLOv8m achieved a mean Average Precision (mAP) of 69.0% on the AGAR dataset, outperforming the best Detectron2 model (Faster R-CNN ResNet-101, 63.1%) by 5.9 percentage points. A four-model WBF ensemble combining both architectures reached 70.5% mAP (95% CI: 68.4-71.7). Cross-subset evaluation showed that a single model trained on the full dataset generalizes well to individual imaging conditions, making subset-specific fine-tuning largely unnecessary. On the curated dataset, a mixed ensemble reached 58.7% mAP (95% CI: 57.1-63.7). These results demonstrate that architecture choice and training data diversity are the primary drivers of performance for colony detection on agar plates.

CloseRead Abstract

2026

A machine learning analysis to identify biomarkers on Holter data of white matter lesions in Fabry disease patients

Authors
Araújo, B; Moura, AR; Veloso, B; Azevedo, O; Gago, MF; Erlhagen, W; Bicho, E; Ferreira, F;

Publication
INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS

Abstract
Fabry disease (FD) is a rare genetic disorder associated with cardiac abnormalities and often overlooked brain white matter lesions (WMLs). Despite the importance of early WMLs detection, diagnosis is frequently delayed. The aim is to identify electrocardiographic biomarkers linked to WMLs in middle-aged FD patients using machine learning, assessing their potential as non-invasive diagnostic tools. This retrospective study analyzed electrocardiographic data from FD patients aged 40-59. A feature selection process based on variance inflation factor analysis identified nine relevant features, including heart rate variability and QT interval parameters. Machine learning classifiers-logistic regression, support vector machines, random forest, and k-nearest neighbors-were trained and evaluated using accuracy, sensitivity, specificity, and AUC. SHAP (SHapley Additive exPlanations) analysis was used to interpret model predictions. The random forest model achieved the highest accuracy (0.81) using all nine features. A subset consisting of SDANN 5 and QTc Min also performed well (accuracy 0.75) in other models. SHAP analysis highlighted SDANN 5 as a key predictor. Machine learning applied to ECG data shows promise for early WML detection in FD, supporting the integration of computational methods into diagnostics for complex genetic diseases.

CloseRead Abstract

2026

Ethical Considerations in the Context of AI-Driven Misinformation Detection

Authors
Ettore Barbagallo; Guillaume Gadek; Géraud Faye; Nina Khairova; Chirag Arora; Dilhan Thilakarathne; Karen Joisten; Sónia Teixeira; Juan M. Durán; Manuel Barrantes;

Publication
Handbook of Human-AI Collaboration

Abstract
Abstract Misinformation poses one of the most urgent challenges of our society and raises the question of how to deal with it and manage its rapid spread. To address this problem, a promising approach relies on AI-based misinformation detection. This chapter of the book offers a critical analysis of the ethical implications associated with the design, deployment, and use of misinformation detectors (MDs). Designing and deploying an MD—an AI system that automatically identifies misinformation—is a complex undertaking that requires an interdisciplinary approach, as the challenges faced by MD designers and deployers encompass not only technical aspects, but also linguistic, sociological, political, and especially ethical dimensions. Our analysis is ethics-oriented and follows two main lines of inquiry: (1) Ethics by Design, which focuses on issues related to the design process of an MD, and (2) Ethics of Impact, which addresses the intended and unintended effects of MD deployment and use.

CloseRead Abstract

2026

Towards Responsible AI Governance: A Multidimensional Ethical Evaluation Framework

Authors
Teixeira, S; Cortés, A; Thilakarathne, D; Gori, G; Minici, M; Bhuyan, M; Khairova, N; Adewumi, T; Bhuyan, D; O'Keefe, J; Comito, C; Gama, J; Dignum, V;

Publication
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2025, PT I

Abstract
As Artificial Intelligence (AI) systems increasingly permeate sensitive domains such as finance, healthcare, and media, ensuring their ethical deployment has become a central concern for researchers, policymakers, and practitioners. Current auditing tools often assess isolated principles, such as fairness or explainability, lacking a comprehensive view of the ethical risks involved. This paper presents a multidimensional framework for ethical evaluation of AI systems, designed to support responsible AI governance and alignment with the United Nations Sustainable Development Goals (SDGs). The proposed approach enables the simultaneous analysis of key ethical dimensions, including fairness, bias, explainability, robustness, transparency, and legal compliance. We demonstrate the applicability of this tool through one extensive case study: a credit scoring system, considered high-risk under the AI Act. This work contributes to operationalizing responsible AI governance, providing insight for policymakers, regulators, and practitioners to ensure ethical, legally compliant, and socially responsible AI deployment.

CloseRead Abstract