Cookies
O website necessita de alguns cookies e outros recursos semelhantes para funcionar. Caso o permita, o INESC TEC irá utilizar cookies para recolher dados sobre as suas visitas, contribuindo, assim, para estatísticas agregadas que permitem melhorar o nosso serviço. Ver mais
Aceitar Rejeitar
  • Menu
Publicações

Publicações por LIAAD

2025

Data Science for Fighting Environmental Crime

Autores
Barbosa, M; Ribeiro, C; Gomes, F; Ribeiro, RP; Gama, J;

Publicação
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II

Abstract
The rise of environmental crimes has become a major concern globally as they cause significant damage to ecosystems, public health and result in economic losses. The availability of vast sensor data provides an opportunity to analyze environmental data proactively. This helps to detect irregularities and uncover potential criminal activities. This paper highlights the critical role played by machine learning (ML) and remote sensing technologies in the continuously evolving scenarios of environmental crime. By examining some case studies on detecting illegal fishing, illegal oil spills, illegal landfills, and illegal logging, we delve into the practical implementation of data-driven approaches for environmental crime detection. Our goal with this study is to provide an overview of the existing research in this area and foster the use of ML and data science techniques to enhance environmental crime detection.

2025

Histogram approaches for imbalanced data streams regression

Autores
Aminian, E; Ribeiro, RP; Gama, J;

Publicação
MACHINE LEARNING

Abstract
Imbalanced domains pose a significant challenge in real-world predictive analytics, particularly in the context of regression. While existing research has primarily focused on batch learning from static datasets, limited attention has been given to imbalanced regression in online learning scenarios. Intending to address this gap, in prior work, we proposed sampling strategies based on Chebyshev's inequality as the first methodologies designed explicitly for data streams. However, these approaches operated under the restrictive assumption that rare instances exclusively reside at distribution extremes. This study introduces histogram-based sampling strategies to overcome this constraint, proposing flexible solutions for imbalanced regression in evolving data streams. The proposed techniques - Histogram-based Undersampling (HistUS) and Histogram-based Oversampling (HistOS) - employ incremental online histograms to dynamically detect and prioritize rare instances across arbitrary regions of the target distribution to improve predictions in the rare cases. Comprehensive experiments on synthetic and real-world benchmarks demonstrate that HistUS and HistOS substantially improve rare-case prediction accuracy, outperforming baseline models while maintaining competitiveness with Chebyshev-based approaches.

2025

Machine Learning and Knowledge Discovery in Databases. Research Track

Autores
Ribeiro, RP; Pfahringer, B; Japkowicz, N; Larrañaga, P; Jorge, AM; Soares, C; Abreu, PH; Gama, J;

Publicação
Lecture Notes in Computer Science

Abstract

2025

Air Quality Data Analysis with Symbolic Principal Components

Autores
Loureiro, P; Oliveira, M; Brito, P; Oliveira, L;

Publicação
Springer Proceedings in Mathematics and Statistics

Abstract
Air pollution is a global challenge with deep implications in public health and environment. We examine air quality data from a monitoring station in Entrecampos, Lisbon, Portugal, using Symbolic Data Analysis. The dataset consists of hourly concentrations of nine pollutants during three years, which are logarithmically transformed and aggregated in intervals, taking the daily minimum and maximum values. The symbolic mean and variance are estimated for each variable through the method of moments, and the pairwise dependencies are captured using a bivariate copula. Symbolic principal component scores are obtained from the estimated covariance matrix and used to fit generalized extreme value distributions. Outlier maps, based on these distributions’ quantiles, are used to identify outlying observations. A comparative analysis with daily average-based outlier detection methods is conducted. The results show the relevance of Symbolic Data Analysis in revealing new insights into air quality. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

2025

Parametric models for distributional data

Autores
Brito, P; Silva, APD;

Publicação
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION

Abstract
We present parametric probabilistic models for numerical distributional variables. The proposed models are based on the representation of each distribution by a location measure and inter-quantile ranges, for given quantiles, thereby characterizing the underlying empirical distributions in a flexible way. Multivariate Normal distributions are assumed for the whole set of indicators, considering alternative structures of the variance-covariance matrix. For all cases, maximum likelihood estimators of the corresponding parameters are derived. This modelling allows for hypothesis testing and multivariate parametric analysis. The proposed framework is applied to Analysis of Variance and parametric Discriminant Analysis of distributional data. A simulation study examines the performance of the proposed models in classification problems under different data conditions. Applications to Internet traffic data and Portuguese official data illustrate the relevance of the proposed approach.

2025

SPATA: Systematic Pattern Analysis for Detailed and Transparent Data Cards

Autores
Vitorino, J; Maia, E; Praça, I; Soares, C;

Publicação
CoRR

Abstract

  • 18
  • 529