Publicacoes - INESC TEC

Publicações

Publicações por Pedro Henriques Abreu

2023

Discovery Science - 26th International Conference, DS 2023, Porto, Portugal, October 9-11, 2023, Proceedings

Autores
Bifet, A; Lorena, AC; Ribeiro, RP; Gama, J; Abreu, PH;

Publicação
DS

Abstract

2025

Pycol: A Python package for dataset complexity measures

Autores
Apóstolo, D; Santos, MS; Lorena, AC; Abreu, PH;

Publicação
NEUROCOMPUTING

Abstract
Class overlap presents a significant challenge to machine learning algorithms, especially when class imbalance is present. These factors contribute substantially to the complexity of classification tasks, particularly in realworld scenarios. As a result, measuring overlap is crucial, yet it remains difficult to quantify due to its intricate nature, since it can manifest and be measured in multiple ways. To help mitigate this, recent research has conceptualized a new taxonomy of class overlap measures, divided into multiple families, which allows researchers to obtain a more complete overview of the complexity of the datasets. In line with recent research, we introduce a new Python package for class overlap measurement named pycol. This package implements 29 overlap measures, divided into four overlap families specifically designed to capture class overlap in imbalanced real-world scenarios. This makes pycol an essential tool for researchers dealing with complex classification problems, providing robust solutions to quantify the joint-effect of class overlap and class imbalance effectively.

FecharLer Abstract

2019

Analyzing the Footprint of Classifiers in Adversarial Denial of Service Contexts

Autores
Martins, N; Cruz, JM; Cruz, T; Abreu, PH;

Publicação
EPIA (2)

Abstract
Adversarial machine learning is an area of study that examines both the generation and detection of adversarial examples, which are inputs specially crafted to deceive classifiers, and has been extensively researched specifically in the area of image recognition, where humanly imperceptible modifications are performed on images that cause a classifier to perform incorrect predictions. The main objective of this paper is to study the behavior of multiple state of the art machine learning algorithms in an adversarial context. To perform this study, six different classification algorithms were used on two datasets, NSL-KDD and CICIDS2017, and four adversarial attack techniques were implemented with multiple perturbation magnitudes. Furthermore, the effectiveness of training the models with adversaries to improve recognition is also tested. The results show that adversarial attacks successfully deteriorate the performance of all the classifiers between 13% and 40%, with the Denoising Autoencoder being the technique with highest resilience to attacks.

FecharLer Abstract

2026

LogicMix: Sample mixing data augmentation for multi-label image classification with partial labels

Autores
Chong, CF; Guo, JL; Yang, X; Ke, W; Abreu, PH; Wang, YP; Im, SK;

Publicação
PATTERN RECOGNITION

Abstract
Multi-label image classification datasets are often partially labeled where many labels are missing, posing a significant challenge to training accurate deep classifiers. Most existing approaches assume the missing labels as negatives and/or exploit image and category relationships to regularize training. Orthogonally, this paper studies blending samples in such incomplete datasets as new samples, extending the training data magnitude to increase generalization. First, the proposed LogicMix mixes multiple partially labeled samples to produce new samples, where their unknown labels are naturally mixed by OR's logical equivalences, without replacement with constants. Subsequently, a Decouple Partial-Asymmetric Loss is proposed to assign separate label-focusing policies to original and new samples, addressing the learning imbalance from the different positive-negative label imbalances between original and augmented samples. Finally, we propose a complete learning framework called 2WayAug-PL. LogicMix and conventional data augmentation collaborate to extend the diversity of new samples in both the sample-sample relation and human prior knowledge, while pseudo-labeling compensates for the lack of labels to provide more supervision signals. 27 partially labeled dataset scenarios derived from three benchmarking datasets with various learning difficulties are utilized for comprehensive experiments. LogicMix has shown remarkable effectiveness and generality in improving mAP against compared sample-mixing data augmentation methods. In particular, 2WayAug-PL achieves state-of-the-art average mAP of 84.3%, 50.1 %, and 93.8% on MS-COCO, VG-200, and Pascal VOC 2007, respectively. It further pushes the previous best performance achieved by different frameworks by 0.6% (CFT), 0.6% (CFT), and 0.1 % (SR). Moreover, 2WayAug-PL significantly outperforms all compared frameworks, as shown by statistical tests. Code is available at: https://github.com/maxium0526/logic_mix.

FecharLer Abstract

2025

Category-wise Fine-Tuning: Resisting incorrect pseudo-labels in multi-label image classification with partial labels

Autores
Chong, CF; Fang, XY; Guo, JL; Abreu, PH; Wang, YP; Yang, X; Kea, W; Im, SK;

Publicação
NEUROCOMPUTING

Abstract
Large-scale image datasets are often partially labeled, where only a few categories' labels are known for each image. Assigning pseudo-labels to unknown labels to gain additional training signals has become prevalent for training deep classification models. However, some pseudo-labels are inevitably incorrect, leading to a notable decline in the model classification performance. In this paper, we propose a new method called Category-wise Fine-Tuning (CFT), aiming to reduce model inaccuracies caused by the wrong pseudo-labels. In particular, CFT employs known labels without pseudo-labels to fine-tune the logistic regressions of trained models individually to calibrate each category's model predictions. Genetic Algorithm, seldom used for training deep models, is also utilized in CFT to maximize the classification performance directly. CFT is applied to well-trained models, unlike most existing methods that train models from scratch. Hence, CFT is general and compatible with models trained with different methods and schemes, as demonstrated through extensive experiments. CFT requires only a few seconds for each category for calibration with consumer-grade GPUs. We achieve state-of-the-art results on three benchmarking datasets, including the CheXpert chest X-ray competition dataset (ensemble mAUC 93.33%, single model 91.82%), partially labeled MS-COCO (average mAP 83.69%), and Open Image V3 (mAP 85.31%), outperforming the previous bests by 0.28%, 2.21%, 2.50%, and 0.91%, respectively. The single model on CheXpert has been officially evaluated by the competition server, endorsing the correctness of the result. The outstanding results and generalizability indicate that CFT could be substantial and prevalent for classification model development. Code is available at: https://github.com/maxium0526/category-wise-fine-tuning.

FecharLer Abstract

2026

mlcpl: A python package for deep multi-label image classification with partial-labels on PyTorch

Autores
Chong, CF; Yang, X; Wang, YP; Abreu, PH;

Publicação
NEUROCOMPUTING

Abstract
Multi-label image classification models often inevitably learn on partially labeled datasets, where a considerable proportion of labels are missing. However, the popular PyTorch deep learning ecosystem is less compatible with training on partially labeled datasets, as many built-in functions like loss functions and metrics do not work correctly or raise errors when unknown labels are present. To this end, we present an original and easy-to-install Python package called mlcpl, which expands the PyTorch ecosystem to offer a friendly environment for learning with partially labeled datasets. The package provides a series of multi-label loss functions and metrics that are compatible with unknown labels. Seven recently proposed approaches are also implemented for the convenient use of cutting-edge techniques. In addition, eleven dataset loading functions, followed by three partial label simulation schemes, expedite the development of experiments. Furthermore, these functions are simple to use, have a PyTorch-like interface, and can collaborate well with other PyTorch components. Several examples of experiments with mlcpl are also provided for demonstration. We wish the release of this package could facilitate relevant academic research and real-world applications. The source code is available at https://github.com/ maxium0526/mlcpl.

FecharLer Abstract