Publicacoes - INESC TEC

Publicações

Publicações por Pedro Henriques Abreu

2020

How distance metrics influence missing data imputation with k-nearest neighbours

Autores
Santos, MS; Abreu, PH; Wilk, S; Santos, J;

Publicação
PATTERN RECOGNITION LETTERS

Abstract
In missing data contexts, k-nearest neighbours imputation has proven beneficial since it takes advantage of the similarity between patterns to replace missing values. When dealing with heterogeneous data, researchers traditionally apply the HEOM distance, that handles continuous, nominal and missing data. Although other heterogeneous distances have been proposed, they have not yet been investigated and compared for k-nearest neighbours imputation. In this work, we study the effect of several heterogeneous distances on k-nearest neighbours imputation on a large benchmark of publicly-available datasets.

FecharLer Abstract

2020

Assessing the Impact of Distance Functions on K-Nearest Neighbours Imputation of Biomedical Datasets

Autores
Santos, MS; Abreu, PH; Wilk, S; Santos, JAM;

Publicação
Artificial Intelligence in Medicine - 18th International Conference on Artificial Intelligence in Medicine, AIME 2020, Minneapolis, MN, USA, August 25-28, 2020, Proceedings

Abstract
In healthcare domains, dealing with missing data is crucial since absent observations compromise the reliability of decision support models. K-nearest neighbours imputation has proven beneficial since it takes advantage of the similarity between patients to replace missing values. Nevertheless, its performance largely depends on the distance function used to evaluate such similarity. In the literature, k-nearest neighbours imputation frequently neglects the nature of data or performs feature transformation, whereas in this work, we study the impact of different heterogeneous distance functions on k-nearest neighbour imputation for biomedical datasets. Our results show that distance functions considerably impact the performance of classifiers learned from the imputed data, especially when data is complex. © 2020, Springer Nature Switzerland AG.

FecharLer Abstract

2018

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches

Autores
Santos, MS; Soares, JP; Abreu, PH; Araujo, H; Santos, J;

Publicação
IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

Abstract
Although cross-validation is a standard procedure for performance evaluation, its joint application with oversampling remains an open question for researchers farther from the imbalanced data topic. A frequent experimental flaw is the application of oversampling algorithms to the entire dataset, resulting in biased models and overly-optimistic estimates. We emphasize and distinguish overoptimism from overfitting, showing that the former is associated with the cross-validation procedure, while the latter is influenced by the chosen oversampling algorithm. Furthermore, we perform a thorough empirical comparison of well-established oversampling algorithms, supported by a data complexity analysis. The best oversampling techniques seem to possess three key characteristics: use of cleaning procedures, cluster-based example synthetization and adaptive weighting of minority examples, where Synthetic Minority Oversampling Technique coupled with Tomek Links and Majority Weighted Minority Oversampling Technique stand out, being capable of increasing the discriminative power of data.

FecharLer Abstract

2018

BI-RADS CLASSIFICATION OF BREAST CANCER: A NEW PRE-PROCESSING PIPELINE FOR DEEP MODELS TRAINING

Autores
Domingues, I; Abreu, PH; Santos, J;

Publicação
2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)

Abstract
One of the main difficulties in the use of deep learning strategies in medical contexts is the training set size. While these methods need large annotated training sets, these datasets are costly to obtain in medical contexts and suffer from intra and inter-subject variability. In the present work, two new pre-processing techniques are introduced to improve a deep classifier performance. First, data augmentation based on co-registration is suggested. Then, multi-scale enhancement based on Difference of Gaussians is proposed. Results are accessed in a public mammogram database, the InBreast, in the context of an ordinal problem, the BI-RADS classification. Moreover, a pre-trained Convolutional Neural Network with the AlexNet architecture was used as a base classifier. The multi-class classification experiments show that the proposed pipeline with the Difference of Gaussians and the data augmentation technique outperforms using the original dataset only and using the original dataset augmented by mirroring the images.

FecharLer Abstract

2020

Interpretability vs. Complexity: The Friction in Deep Neural Networks

Autores
Amorim, JP; Abreu, PH; Reyes, M; Santos, J;

Publicação
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

Abstract
Saliency maps have been used as one possibility to interpret deep neural networks. This method estimates the relevance of each pixel in the image classification, with higher values representing pixels which contribute positively to classification. The goal of this study is to understand how the complexity of the network affects the interpretabilty of the saliency maps in classification tasks. To achieve that, we investigate how changes in the regularization affects the saliency maps produced, and their fidelity to the overall classification process of the network. The experimental setup consists in the calculation of the fidelity of five saliency map methods that were compare, applying them to models trained on the CIFAR-10 dataset, using different levels of weight decay on some or all the layers. Achieved results show that models with lower regularization are statistically (significance of 5%) more interpretable than the other models. Also, regularization applied only to the higher convolutional layers or fully-connected layers produce saliency maps with more fidelity.

FecharLer Abstract

2018

Exploring the effects of data distribution in missing data imputation

Autores
Pompeu Soares, J; Seoane Santos, M; Henriques Abreu, P; Araújo, H; Santos, J;

Publicação
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
In data imputation problems, researchers typically use several techniques, individually or in combination, in order to find the one that presents the best performance over all the features comprised in the dataset. This strategy, however, neglects the nature of data (data distribution) and makes impractical the generalisation of the findings, since for new datasets, a huge number of new, time consuming experiments need to be performed. To overcome this issue, this work aims to understand the relationship between data distribution and the performance of standard imputation techniques, providing a heuristic on the choice of proper imputation methods and avoiding the needs to test a large set of methods. To this end, several datasets were selected considering different sample sizes, number of features, distributions and contexts and missing values were inserted at different percentages and scenarios. Then, different imputation methods were evaluated in terms of predictive and distributional accuracy. Our findings show that there is a relationship between features’ distribution and algorithms’ performance, and that their performance seems to be affected by the combination of missing rate and scenario at state and also other less obvious factors such as sample size, goodness-of-fit of features and the ratio between the number of features and the different distributions comprised in the dataset. © Springer Nature Switzerland AG 2018.

FecharLer Abstract