Miriam Seoane Santos

O website necessita de alguns cookies e outros recursos semelhantes para funcionar. Caso o permita, o INESC TEC irá utilizar cookies para recolher dados sobre as suas visitas, contribuindo, assim, para estatísticas agregadas que permitem melhorar o nosso serviço. Ver mais

Instituição
Investigação
Domínios de Investigação
Inteligência Artificial

Bioengenharia

Comunicações

Ciência e Engenharia dos Computadores

Fotónica

Sistemas de Energia

Robótica

Engenharia e Gestão de Sistemas
CENTROS DE INVESTIGAÇÃO
Porto, Portugal

+351 222 094 000

info@inesctec.pt
Inovação
Inovação / Tec4

TEC4AGRO-FOOD

TEC4ENERGY

TEC4HEALTH

TEC4INDUSTRY

TEC4SEA

TECPARTNERSHIPS

Tecnologias Disponíveis
Porto, Portugal

+351 222 094 000

info@inesctec.pt
Laboratórios
Laboratórios de Investigação

iilab
Comunicação
Notícias

Eventos

Media

Boletim Informativo
Porto, Portugal

+351 222 094 000

info@inesctec.pt
Junte-se a nós
Contactos

Home
Pessoas
Miriam Seoane Santos

Tópicos
de interesse

Detalhes

Nome
Miriam Seoane Santos
Cargo
Investigador Sénior
Desde
01 janeiro 2024

Nacionalidade
Portugal
Centro
Laboratório de Inteligência Artificial e Apoio à Decisão
Contactos
+351222094057
miriam.s.santos@inesctec.pt

Publicações

Ler todas as publicações

2025

Studying the robustness of data imputation methodologies against adversarial attacks

Autores
Mangussi, AD; Pereira, RC; Lorena, AC; Santos, MS; Abreu, PH;

Publicação
COMPUTERS & SECURITY

Abstract
Cybersecurity attacks, such as poisoning and evasion, can intentionally introduce false or misleading information in different forms into data, potentially leading to catastrophic consequences for critical infrastructures, like water supply or energy power plants. While numerous studies have investigated the impact of these attacks on model-based prediction approaches, they often overlook the impurities present in the data used to train these models. One of those forms is missing data, the absence of values in one or more features. This issue is typically addressed by imputing missing values with plausible estimates, which directly impacts the performance of the classifier. The goal of this work is to promote a Data-centric AI approach by investigating how different types of cybersecurity attacks impact the imputation process. To this end, we conducted experiments using four popular evasion and poisoning attacks strategies across 29 real-world datasets, including the NSL-KDD and Edge-IIoT datasets, which were used as case study. For the adversarial attack strategies, we employed the Fast Gradient Sign Method, Carlini & Wagner, Project Gradient Descent, and Poison Attack against Support Vector Machine algorithm. Also, four state-of-the-art imputation strategies were tested under Missing Not At Random, Missing Completely at Random, and Missing At Random mechanisms using three missing rates (5%, 20%, 40%). We assessed imputation quality using MAE, while data distribution shifts were analyzed with the Kolmogorov-Smirnov and Chi-square tests. Furthermore, we measured classification performance by training an XGBoost classifier on the imputed datasets, using F1-score, Accuracy, and AUC. To deepen our analysis, we also incorporated six complexity metrics to characterize how adversarial attacks and imputation strategies impact dataset complexity. Our findings demonstrate that adversarial attacks significantly impact the imputation process. In terms of imputation assessment in what concerns to quality error, the scenario that enrolees imputation with Project Gradient Descent attack proved to be more robust in comparison to other adversarial methods. Regarding data distribution error, results from the Kolmogorov-Smirnov test indicate that in the context of numerical features, all imputation strategies differ from the baseline (without missing data) however for the categorical context Chi-Squared test proved no difference between imputation and the baseline.

FecharLer Abstract

2025

QIDLEARNINGLIB: A Python library for quasi-identifier recognition and evaluation

Autores
Simoes, SA; Vilela, JP; Santos, MS; Abreu, PH;

Publicação
NEUROCOMPUTING

Abstract
Quasi-identifiers (QIDs) are attributes in a dataset that are not directly unique identifiers of the users/entities themselves but can be used, often in conjunction with other datasets or information, to identify individuals and thus present a privacy risk in data sharing and analysis. Identifying QIDs is important in developing proper strategies for anonymization and data sanitization. This paper proposes QIDLEARNINGLIB, a Python library that offers a set of metrics and tools to measure the qualities of QIDs and identify them in data sets. It incorporates metrics from different domains-causality, privacy, data utility, and performance-to offer a holistic assessment of the properties of attributes in a given tabular dataset. Furthermore, QIDLEARNINGLIB offers visual analysis tools to present how these metrics shift over a dataset and implements an extensible framework that employs multiple optimization algorithms such as an evolutionary algorithm, simulated annealing, and greedy search using these metrics to identify a meaningful set of QIDs.

FecharLer Abstract

2025

A Label Propagation Approach for Missing Data Imputation

Autores
Lopes, FL; Mangussi, AD; Pereira, RC; Santos, MS; Abreu, PH; Lorena, AC;

Publicação
IEEE ACCESS

Abstract
Missing data is a common challenge in real-world datasets and can arise for various reasons. This has led to the classification of missing data mechanisms as missing completely at random, missing at random, or missing not at random. Currently, the literature offers various algorithms for imputing missing data, each with advantages tailored to specific mechanisms and levels of missingness. This paper introduces a novel approach to missing data imputation using the well-established label propagation algorithm, named Label Propagation for Missing Data Imputation (LPMD). The method combines, weighs, and propagates known feature values to impute missing data. Experiments on benchmark datasets highlight its effectiveness across various missing data scenarios, demonstrating more stable results compared to baseline methods under different missingness mechanisms and levels. The algorithms were evaluated based on processing time, imputation quality (measured by mean absolute error), and impact on classification performance. A variant of the algorithm (LPMD2) generally achieved the fastest processing time compared to other five imputation algorithms from the literature, with speed-ups ranging from 0.7 to 23 times. The results of LPMD were also stable regarding the mean absolute error of the imputed values compared to their original counterparts, for different missing data mechanisms and rates of missing values. In real applications, missingness can behave according to different and unknown mechanisms, so an imputation algorithm that behaves stably for different mechanisms is advantageous. The results regarding ML models produced using the imputed datasets were also comparable to the baselines.

FecharLer Abstract

2025

mdatagen: A python library for the artificial generation of missing data

Autores
Mangussi, AD; Santos, MS; Lopes, FL; Pereira, RC; Lorena, AC; Abreu, PH;

Publicação
NEUROCOMPUTING

Abstract
Missing data is characterized by the presence of absent values in data (i.e., missing values) and it is currently categorized into three different mechanisms: Missing Completely at Random, Missing At Random, and Missing Not At Random. When performing missing data experiments and evaluating techniques to handle absent values, these mechanisms are often artificially generated (a process referred to as data amputation) to assess the robustness and behavior of the used methods. Due to the lack of a standard benchmark for data amputation, different implementations of the mechanisms are used in related research (some are often not disclaimed), preventing the reproducibility of results and leading to an unfair or inaccurate comparison between existing and new methods. Moreover, for users outside the field, experimenting with missing data or simulating the appearance of missing values in real-world domains is unfeasible, impairing stress testing in machine learning systems. This work introduces mdatagen, an open source Python library for the generation of missing data mechanisms across 20 distinct scenarios, following different univariate and multivariate implementations of the established missing mechanisms. The package therefore fosters reproducible results across missing data experiments and enables the simulation of artificial missing data under flexible configurations, making it very versatile to mimic several real-world applications involving missing data. The source code and detailed documentation for mdatagen are available at https://github.com/ArthurMangussi/pymdatagen.

FecharLer Abstract

2025

The Role of Deep Learning in Medical Image Inpainting: A Systematic Review

Autores
Santos, JC; Alexandre, HTP; Santos, MS; Abreu, PH;

Publicação
ACM TRANSACTIONS ON COMPUTING FOR HEALTHCARE

Abstract
Image inpainting is a crucial technique in computer vision, particularly for reconstructing corrupted images. In medical imaging, it addresses issues from instrumental errors, artifacts, or human factors. The development of deep learning techniques has revolutionized image inpainting, allowing for the generation of high-level semantic information to ensure structural and textural consistency in restored images. This article presents a comprehensive review of 53 studies on deep image inpainting in medical imaging, analyzing its evolution, impact, and limitations. The findings highlight the significance of deep image inpainting in artifact removal and enhancing the performance of multi-task approaches by localizing and inpainting regions of interest. Furthermore, the study identifies magnetic resonance imaging and computed tomography as the predominant modalities and highlights generative adversarial networks and U-Net as preferred architectures. Future research directions include the development of blind inpainting techniques, the exploration of techniques suitable for 3D/4D images, multiple artifacts, and multi-task applications, and the improvement of architectures.

FecharLer Abstract

Miriam Seoane Santos

Detalhes

Nome

Cargo

Desde

Nacionalidade

Centro

Contactos

Studying the robustness of data imputation methodologies against adversarial attacks

QIDLEARNINGLIB: A Python library for quasi-identifier recognition and evaluation

A Label Propagation Approach for Missing Data Imputation

mdatagen: A python library for the artificial generation of missing data

The Role of Deep Learning in Medical Image Inpainting: A Systematic Review