Publications

Publications by Pedro Henriques Abreu

2020

Interpretability vs. Complexity: The Friction in Deep Neural Networks

Authors
Amorim, JP; Abreu, PH; Reyes, M; Santos, J;

Publication
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

Abstract
Saliency maps have been used as one possibility to interpret deep neural networks. This method estimates the relevance of each pixel in the image classification, with higher values representing pixels which contribute positively to classification. The goal of this study is to understand how the complexity of the network affects the interpretabilty of the saliency maps in classification tasks. To achieve that, we investigate how changes in the regularization affects the saliency maps produced, and their fidelity to the overall classification process of the network. The experimental setup consists in the calculation of the fidelity of five saliency map methods that were compare, applying them to models trained on the CIFAR-10 dataset, using different levels of weight decay on some or all the layers. Achieved results show that models with lower regularization are statistically (significance of 5%) more interpretable than the other models. Also, regularization applied only to the higher convolutional layers or fully-connected layers produce saliency maps with more fidelity.

CloseRead Abstract

2018

Exploring the effects of data distribution in missing data imputation

Authors
Pompeu Soares, J; Seoane Santos, M; Henriques Abreu, P; Araújo, H; Santos, J;

Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
In data imputation problems, researchers typically use several techniques, individually or in combination, in order to find the one that presents the best performance over all the features comprised in the dataset. This strategy, however, neglects the nature of data (data distribution) and makes impractical the generalisation of the findings, since for new datasets, a huge number of new, time consuming experiments need to be performed. To overcome this issue, this work aims to understand the relationship between data distribution and the performance of standard imputation techniques, providing a heuristic on the choice of proper imputation methods and avoiding the needs to test a large set of methods. To this end, several datasets were selected considering different sample sizes, number of features, distributions and contexts and missing values were inserted at different percentages and scenarios. Then, different imputation methods were evaluated in terms of predictive and distributional accuracy. Our findings show that there is a relationship between features’ distribution and algorithms’ performance, and that their performance seems to be affected by the combination of missing rate and scenario at state and also other less obvious factors such as sample size, goodness-of-fit of features and the ratio between the number of features and the different distributions comprised in the dataset. © Springer Nature Switzerland AG 2018.

CloseRead Abstract

2022

The impact of heterogeneous distance functions on missing data imputation and classification performance

Authors
Santos, MS; Abreu, PH; Fernandez, A; Luengo, J; Santos, J;

Publication
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

Abstract
This work performs an in-depth study of the impact of distance functions on K-Nearest Neighbours imputation of heterogeneous datasets. Missing data is generated at several percentages, on a large benchmark of 150 datasets (50 continuous, 50 categorical and 50 heterogeneous datasets) and data imputation is performed using different distance functions (HEOM, HEOM-R, HVDM, HVDM-R, HVDM-S, MDE and SIMDIST) and k values (1, 3, 5 and 7). The impact of distance functions on kNN imputation is then evaluated in terms of classification performance, through the analysis of a classifier learned from the imputed data, and in terms of imputation quality, where the quality of the reconstruction of the original values is assessed. By analysing the properties of heterogeneous distance functions over continuous and categorical datasets individually, we then study their behaviour over heterogeneous data. We discuss whether datasets with different natures may benefit from different distance functions and to what extent the component of a distance function that deals with missing values influences such choice. Our experiments show that missing data has a significant impact on distance computation and the obtained results provide guidelines on how to choose appropriate distance functions depending on data characteristics (continuous, categorical or heterogeneous datasets) and the objective of the study (classification or imputation tasks).

CloseRead Abstract

2017

Image descriptors in radiology images: a systematic review

Authors
Nogueira, MA; Abreu, PH; Martins, P; Machado, P; Duarte, H; Santos, J;

Publication
ARTIFICIAL INTELLIGENCE REVIEW

Abstract
Clinical decisions are sometimes based on a variety of patient's information such as: age, weight or information extracted from image exams, among others. Depending on the nature of the disease or anatomy, clinicians can base their decisions on different image exams like mammographies, positron emission tomography scans or magnetic resonance images. However, the analysis of those exams is far from a trivial task. Over the years, the use of image descriptors-computational algorithms that present a summarized description of image regions-became an important tool to assist the clinician in such tasks. This paper presents an overview of the use of image descriptors in healthcare contexts, attending to different image exams. In the making of this review, we analyzed over 70 studies related to the application of image descriptors of different natures-e.g., intensity, texture, shape-in medical image analysis. Four imaging modalities are featured: mammography, PET, CT and MRI. Pathologies typically covered by these modalities are addressed: breast masses and microcalcifications in mammograms, head and neck cancer and Alzheimer's disease in the case of PET images, lung nodules regarding CTs and multiple sclerosis and brain tumors in the MRI section.

CloseRead Abstract

2019

Generating Synthetic Missing Data: A Review by Missing Mechanism

Authors
Santos, MS; Pereira, RC; Costa, AF; Soares, JP; Santos, J; Abreu, PH;

Publication
IEEE ACCESS

Abstract
The performance evaluation of imputation algorithms often involves the generation of missing values. Missing values can be inserted in only one feature (univariate configuration) or in several features (multivariate configuration) at different percentages (missing rates) and according to distinct missing mechanisms, namely, missing completely at random, missing at random, and missing not at random. Since the missing data generation process defines the basis for the imputation experiments (configuration, missing rate, and missing mechanism), it is essential that it is appropriately applied; otherwise, conclusions derived from ill-defined setups may be invalid. The goal of this paper is to review the different approaches to synthetic missing data generation found in the literature and discuss their practical details, elaborating on their strengths and weaknesses. Our analysis revealed that creating missing at random and missing not at random scenarios in datasets comprising qualitative features is the most challenging issue in the related work and, therefore, should be the focus of future work in the field.

CloseRead Abstract

2018

Evaluation of Oversampling Data Balancing Techniques in the Context of Ordinal Classification

Authors
Domingues, I; Amorim, JP; Abreu, PH; Duarte, H; Santos, JAM;

Publication
2018 International Joint Conference on Neural Networks, IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018

Abstract
Data imbalance is characterized by a discrepancy in the number of examples per class of a dataset. This phenomenon is known to deteriorate the performance of classifiers, since they are less able to learn the characteristics of the less represented classes. For most imbalanced datasets, the application of sampling techniques improves the classifier's performance. For small datasets, oversampling has been shown to be the most appropriate strategy since it augments the original set of samples. Although several oversampling strategies have been proposed and tested over the years, the work has mostly focused on binary or multi-class tasks. Motivated by medical applications, where there is often an order associated with the classes (increasing likelihood of malignancy, for instance), the present work tests some existing oversampling techniques in ordinal contexts. Moreover, four new oversampling techniques are proposed. Experiments were made both on private and public datasets. Private datasets concern the assessment of response to treatment on oncologic diseases. The 15 public datasets were chosen since they are widely used in the literature. Results show that data balance techniques improve classification results on ordinal imbalanced datasets, even when these techniques are not specifically designed for ordinal problems. With our pipeline, better or equal to published results were obtained for 10 out of the 15 public datasets with improvements upon a decrease of 0.43 on MMAE.

CloseRead Abstract