Publications

Publications by Miriam Seoane Santos

2019

Generating Synthetic Missing Data: A Review by Missing Mechanism

Authors
Santos, MS; Pereira, RC; Costa, AF; Soares, JP; Santos, J; Abreu, PH;

Publication
IEEE ACCESS

Abstract
The performance evaluation of imputation algorithms often involves the generation of missing values. Missing values can be inserted in only one feature (univariate configuration) or in several features (multivariate configuration) at different percentages (missing rates) and according to distinct missing mechanisms, namely, missing completely at random, missing at random, and missing not at random. Since the missing data generation process defines the basis for the imputation experiments (configuration, missing rate, and missing mechanism), it is essential that it is appropriately applied; otherwise, conclusions derived from ill-defined setups may be invalid. The goal of this paper is to review the different approaches to synthetic missing data generation found in the literature and discuss their practical details, elaborating on their strengths and weaknesses. Our analysis revealed that creating missing at random and missing not at random scenarios in datasets comprising qualitative features is the most challenging issue in the related work and, therefore, should be the focus of future work in the field.

CloseRead Abstract

2016

Predicting Breast Cancer Recurrence Using Machine Learning Techniques: A Systematic Review

Authors
Abreu, PH; Santos, MS; Abreu, MH; Andrade, B; Silva, DC;

Publication
ACM COMPUTING SURVEYS

Abstract
Background: Recurrence is an important cornerstone in breast cancer behavior, intrinsically related to mortality. In spite of its relevance, it is rarely recorded in the majority of breast cancer datasets, which makes research in its prediction more difficult. Objectives: To evaluate the performance of machine learning techniques applied to the prediction of breast cancer recurrence. Material and Methods: Revision of published works that used machine learning techniques in local and open source databases between 1997 and 2014. Results: The revision showed that it is difficult to obtain a representative dataset for breast cancer recurrence and there is no consensus on the best set of predictors for this disease. High accuracy results are often achieved, yet compromising sensitivity. The missing data and class imbalance problems are rarely addressed and most often the chosen performance metrics are inappropriate for the context. Discussion and Conclusions: Although different techniques have been used, prediction of breast cancer recurrence is still an open problem. The combination of different machine learning techniques, along with the definition of standard predictors for breast cancer recurrence seem to be the main future directions to obtain better results.

CloseRead Abstract

2018

Improving the Classifier Performance in Motor Imagery Task Classification: What are the steps in the classification process that we should worry about?

Authors
Santos, MS; Abreu, PH; Rodríguez Bermúdez, G; García Laencina, PJ;

Publication
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS

Abstract
Brain-Computer Interface systems based on motor imagery are able to identify an individual's intent to initiate control through the classification of encephalography patterns. Correctly classifying such patterns is instrumental and strongly depends in a robust machine learning block that is able to properly process the features extracted from a subject's encephalograms. The main objective of this work is to provide an overall view on machine learning stages, aiming to answer the following question: "What are the steps in the classification process that we should worry about?". The obtained results suggest that future research in the field should focus on two main aspects: exploring techniques for dimensionality reduction, in particular, supervised linear approaches, and evaluating adequate validation schemes to allow a more precise interpretation of results.

CloseRead Abstract

2022

The identification of cancer lesions in mammography images with missing pixels: analysis of morphology

Authors
Santos, JC; Abreu, PH; Santos, MS;

Publication
2022 IEEE 9TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA)

Abstract
The quality of mammography images is essential for the diagnosis of breast cancer and image imputation has become a popular technique to overcome noise, artifacts, and missing data to aid in the diagnosis of diseases. In this paper, we assess the performance of six imputation methodologies for the reconstruction of missing pixels in different morphologies in mammography images. The images included in this study are collected from four public datasets (CBIS-DDSM, Mini-MIAS, INbreast, and CSAW) and the imputation results are evaluated through the mean absolute error (MAE) and structural similarity index measure (SSIM). This study goes beyond the traditional evaluation of imputation algorithms, analyzing imputation quality, morphology preservation and classification performance. The effects of imputation on the morphology of cancer lesions are of utmost importance since it lays the foundation for physicians to interpret and analyze the imputation results. The results show that DIP is the most promising methodology for higher missing pixel rates, morphology preservation, and classifying malignant and benign images.

CloseRead Abstract

2021

FAWOS: Fairness-Aware Oversampling Algorithm Based on Distributions of Sensitive Attributes

Authors
Salazar, T; Santos, MS; Araújo, H; Abreu, PH;

Publication
IEEE ACCESS

Abstract
With the increased use of machine learning algorithms to make decisions which impact people's lives, it is of extreme importance to ensure that predictions do not prejudice subgroups of the population with respect to sensitive attributes such as race or gender. Discrimination occurs when the probability of a positive outcome changes across privileged and unprivileged groups defined by the sensitive attributes. It has been shown that this bias can be originated from imbalanced data contexts where one of the classes contains a much smaller number of instances than the other classes. It is also important to identify the nature of the imbalanced data, including the characteristics of the minority classes' distribution. This paper presents FAWOS: a Fairness-Aware oversampling algorithm which aims to attenuate unfair treatment by handling sensitive attributes' imbalance. We categorize different types of datapoints according to their local neighbourhood with respect to the sensitive attributes, identifying which are more difficult to learn by the classifiers. In order to balance the dataset, FAWOS oversamples the training data by creating new synthetic datapoints using the different types of datapoints identified. We test the impact of FAWOS on different learning classifiers and analyze which can better handle sensitive attribute imbalance. Empirically, we observe that this algorithm can effectively increase the fairness results of the classifiers while not neglecting the classification performance. Source code can be found at: https://github.com/teresalazar13/FAWOS

CloseRead Abstract

2018

Missing Data Imputation via Denoising Autoencoders: The Untold Story

Authors
Costa, AF; Santos, MS; Soares, JP; Abreu, PH;

Publication
IDA

Abstract
Missing data consists in the lack of information in a dataset and since it directly influences classification performance, neglecting it is not a valid option. Over the years, several studies presented alternative imputation strategies to deal with the three missing data mechanisms, Missing Completely At Random, Missing At Random and Missing Not At Random. However, there are no studies regarding the influence of all these three mechanisms on the latest high-performance Artificial Intelligence techniques, such as Deep Learning. The goal of this work is to perform a comparison study between state-of-the-art imputation techniques and a Stacked Denoising Autoencoders approach. To that end, the missing data mechanisms were synthetically generated in 6 different ways; 8 different imputation techniques were implemented; and finally, 33 complete datasets from different open source repositories were selected. The obtained results showed that Support Vector Machines imputation ensures the best classification performance while Multiple Imputation by Chained Equations performs better in terms of imputation quality.

CloseRead Abstract