2025
Authors
Mangussi, AD; Pereira, RC; Abreu, PH; Lorena, AC;
Publication
INTELLIGENT SYSTEMS, BRACIS 2024, PT I
Abstract
In real-world scenarios, a wide variety of datasets contain inconsistencies. One example of such inconsistency is missing data (MD), which refers to the absence of information in one or more variables. Missing imputation strategies emerged as a possible solution for addressing this problem, which can replace the missing values based on mean, median, or Machine Learning (ML) techniques. The performance of such strategies depends on multiple factors. One factor that influences the missing value imputation (MVI) methods is the presence of noisy instances, described as anything that obscures the relationship between the features of an instance and its class, having an adversarial effect. However, the interaction between MD and noisy instances has received little attention in the literature. This work fills this gap by investigating missing and noisy data interplay. Our experimental setup begins with generating missingness under the Missing Not at Random (MNAR) mechanism in a multivariate scenario and performing imputation using seven state-of-the-art MVI methods. Our methodology involves applying a noise filter before performing the imputation task and evaluating the quality of the imputation directly. Additionally, we measure the classification performance with the new estimates. This approach is applied to both synthetic data and 11 real-world datasets. The effects of noise filtering before imputation are evaluated. The results show that noise preprocessing before the imputation task improves the imputation quality and the classification performance for imputed datasets.
2025
Authors
Mangussi, AD; Santos, MS; Lopes, FL; Pereira, RC; Lorena, AC; Abreu, PH;
Publication
NEUROCOMPUTING
Abstract
Missing data is characterized by the presence of absent values in data (i.e., missing values) and it is currently categorized into three different mechanisms: Missing Completely at Random, Missing At Random, and Missing Not At Random. When performing missing data experiments and evaluating techniques to handle absent values, these mechanisms are often artificially generated (a process referred to as data amputation) to assess the robustness and behavior of the used methods. Due to the lack of a standard benchmark for data amputation, different implementations of the mechanisms are used in related research (some are often not disclaimed), preventing the reproducibility of results and leading to an unfair or inaccurate comparison between existing and new methods. Moreover, for users outside the field, experimenting with missing data or simulating the appearance of missing values in real-world domains is unfeasible, impairing stress testing in machine learning systems. This work introduces mdatagen, an open source Python library for the generation of missing data mechanisms across 20 distinct scenarios, following different univariate and multivariate implementations of the established missing mechanisms. The package therefore fosters reproducible results across missing data experiments and enables the simulation of artificial missing data under flexible configurations, making it very versatile to mimic several real-world applications involving missing data. The source code and detailed documentation for mdatagen are available at https://github.com/ArthurMangussi/pymdatagen.
2025
Authors
Santos, JC; Tomás Pereira Alexandre, H; Seoane Santos, M; Henriques Abreu, P;
Publication
ACM Transactions on Computing for Healthcare
Abstract
2024
Authors
Mendes Neves, T; Seca, D; Sousa, R; Ribeiro, C; Mendes Moreira, J;
Publication
COMPUTATIONAL ECONOMICS
Abstract
As many automated algorithms find their way into the IT systems of the banking sector, having a way to validate and interpret the results from these algorithms can lead to a substantial reduction in the risks associated with automation. Usually, validating these pricing mechanisms requires human resources to manually analyze and validate large quantities of data. There is a lack of effective methods that analyze the time series and understand if what is currently happening is plausible based on previous data, without information about the variables used to calculate the price of the asset. This paper describes an implementation of a process that allows us to validate many data points automatically. We explore the K-Nearest Neighbors algorithm to find coincident patterns in financial time series, allowing us to detect anomalies, outliers, and data points that do not follow normal behavior. This system allows quicker detection of defective calculations that would otherwise result in the incorrect pricing of financial assets. Furthermore, our method does not require knowledge about the variables used to calculate the time series being analyzed. Our proposal uses pattern matching and can validate more than 58% of instances, substantially improving human risk analysts' efficiency. The proposal is completely transparent, allowing analysts to understand how the algorithm made its decision, increasing the trustworthiness of the method.
2024
Authors
Pinto, J; Esteves, V; Tavares, S; Sousa, R;
Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE
Abstract
The power transformer is one of the key components of any electrical grid, and, as such, modern day industrialization activities require constant usage of the asset. This increases the possibility of failures and can potentially diminish the lifespan of a power transformer. Dissolved gas analysis (DGA) is a technique developed to quantify the existence of hydrocarbon gases in the content of the power transformer oil, which in turn can indicate the presence of faults. Since this process requires different chemical analysis for each type of gas, the overall cost of the operation increases with number of gases. Thus said, a machine learning methodology was defined to meet two simultaneous objectives, identify gas subsets, and predict the remaining gases, thus restoring them. Two subsets of equal or smaller size to those used by traditional methods (Duval's triangle, Roger's ratio, IEC table) were identified, while showing potentially superior performance. The models restored the discarded gases, and the restored set was compared with the original set in a variety of validation tasks.
2024
Authors
Guimaraes, N; Campos, R; Jorge, A;
Publication
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY
Abstract
Large language models (LLMs) have substantially pushed artificial intelligence (AI) research and applications in the last few years. They are currently able to achieve high effectiveness in different natural language processing (NLP) tasks, such as machine translation, named entity recognition, text classification, question answering, or text summarization. Recently, significant attention has been drawn to OpenAI's GPT models' capabilities and extremely accessible interface. LLMs are nowadays routinely used and studied for downstream tasks and specific applications with great success, pushing forward the state of the art in almost all of them. However, they also exhibit impressive inference capabilities when used off the shelf without further training. In this paper, we aim to study the behavior of pre-trained language models (PLMs) in some inference tasks they were not initially trained for. Therefore, we focus our attention on very recent research works related to the inference capabilities of PLMs in some selected tasks such as factual probing and common-sense reasoning. We highlight relevant achievements made by these models, as well as some of their current limitations that open opportunities for further research.This article is categorized under:Fundamental Concepts of Data and Knowledge > Key Design Issues in DataMiningTechnologies > Artificial Intelligence
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.