Publicacoes - INESC TEC

Publicações

Publicações por João Gama

2025

Evaluating Short Text Stream Clustering on Large E-commerce Datasets

Autores
Andrade, C; Ribeiro, RP; Gama, J;

Publicação
INTELLIGENT SYSTEMS, BRACIS 2024, PT III

Abstract
Latent Dirichlet Allocation (LDA) is a fundamental method for clustering short text streams. However, when applied to large datasets, it often faces significant challenges, and its performance is typically evaluated in domain-specific datasets such as news and tweets. This study aims to fill this gap by evaluating the effectiveness of short text clustering methods in a large and diverse e-commerce dataset. We specifically investigate how well these clustering algorithms adapt to the complex dynamics and larger scale of e-commerce text streams, which differ from their usual application domains. Our analysis focuses on the impact of high homogeneity scores on the reported Normalized Mutual Information (NMI) values. We particularly examine whether these scores are inflated due to the prevalence of single-element clusters. To address potential biases in clustering evaluation, we propose using the Akaike Information Criterion (AIC) as an alternative metric to reduce the formation of single-element clusters and provide a more balanced measure of clustering performance. We present new insights for applying short text clustering methodologies in real-world situations, especially in sectors like e-commerce, where text data volumes and dynamics present unique challenges.

FecharLer Abstract

2025

Anomaly Detection in Pet Behavioural Data

Autores
Silva, I; Ribeiro, RP; Gama, J;

Publicação
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II

Abstract
Pet owners are increasingly becoming conscious of their pet's necessities and are paying more attention to their overall wellness. The well-being of their pets is intricately linked to their own emotional and physical well-being. Some veterinary system solutions are emerging to provide proactive healthcare options for pets. One such solution offers the continuous monitoring of a pet's activity through accelerometer tracking devices. Based on data collected by this application, in this paper, we study different time aggregation and three unsupervised machine learning techniques to identify anomalies in pet behaviour data. Specifically, three algorithms, Isolation Forest, Local Outlier Factor, and K-Nearest Neighbour, with various thresholds to differentiate between normal and abnormal events. Results conducted on ten pets (five cats and five dogs) show that the most effective approach is to use daily data divided into periods. Moreover, the Local Outlier Factor is the best algorithm for detecting anomalies when prioritizing the identification of true positives. However, it also produces a high false positive ratio.

FecharLer Abstract

2025

Data Science for Fighting Environmental Crime

Autores
Barbosa, M; Ribeiro, C; Gomes, F; Ribeiro, RP; Gama, J;

Publicação
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II

Abstract
The rise of environmental crimes has become a major concern globally as they cause significant damage to ecosystems, public health and result in economic losses. The availability of vast sensor data provides an opportunity to analyze environmental data proactively. This helps to detect irregularities and uncover potential criminal activities. This paper highlights the critical role played by machine learning (ML) and remote sensing technologies in the continuously evolving scenarios of environmental crime. By examining some case studies on detecting illegal fishing, illegal oil spills, illegal landfills, and illegal logging, we delve into the practical implementation of data-driven approaches for environmental crime detection. Our goal with this study is to provide an overview of the existing research in this area and foster the use of ML and data science techniques to enhance environmental crime detection.

FecharLer Abstract

2025

Fairness Analysis in Causal Models: An Application to Public Procurement

Autores
Teixeira, S; Nogueira, AR; Gama, J;

Publicação
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II

Abstract
Data-driven decision models based on Artificial Intelligence (AI) have been widely used in the public and private sectors. These models present challenges and are intended to be fair, effective and transparent in public interest areas. Bias, fairness and government transparency are aspects that significantly impact the functioning of a democratic society. They shape the government's and its citizens' relationship, influencing trust, accountability, and the equitable treatment of individuals and groups. Data-driven decision models can be biased at several process stages, contributing to injustices. Our research purpose is to understand fairness in the use of causal discovery for public procurement. By analysing Portuguese public contracts data, we aim i) to predict the place of execution of public contracts using the PC algorithm with sp-mi, smc-chi(2) and mc-chi(2) conditional independence tests; ii) to analyse and compare the fairness in those scenarios using Predictive Parity Rate, Proportional Parity, Demographic Parity and Accuracy Parity metrics. By addressing fairness concerns, we pursue to enhance responsible data-driven decision models. We conclude that, in our case, fairness metrics make an assessment more local than global due to causality pathways. We also observe that the Proportional Parity metric is the one with the lowest variance among all metrics and one with the highest precision, and this reinforces the observation that the Agency category is the one that is furthest apart in terms of the proportion of the groups.

FecharLer Abstract

2024

Recent Advances in Learning from Data Streams

Autores
Gama, J;

Publicação
Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2024, Volume 1: KDIR, Porto, Portugal, November 17-19, 2024.

Abstract

2024

Next Location Prediction with Time-Evolving Markov Models over Data Streams

Autores
Andrade, T; Gama, J;

Publicação
Progress in Artificial Intelligence - 23rd EPIA Conference on Artificial Intelligence, EPIA 2024, Viana do Castelo, Portugal, September 3-6, 2024, Proceedings, Part III

Abstract
Various relevant aspects of our lives relate to the places we visit and our daily activities. The movement of individuals between regular places, such as work, school, or other important personal locations is getting increasing attention due to the pervasiveness of geolocation devices and the amount of data they generate. This paper presents an approach for personal location prediction using a probabilistic model and data mining techniques over mobility data streams. We extract the individuals’ locations from relevant events in a data stream to build and maintain a Markov Chain over the important places. We evaluate the method over 3 real-world datasets. The results show the usefulness of the proposal in comparison with other well-known approaches. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

FecharLer Abstract