Publications

Publications by Paulo Jorge Azevedo

2015

Contrast set mining in temporal databases

Authors
Magalhaes, A; Azevedo, PJ;

Publication
EXPERT SYSTEMS

Abstract
Understanding the underlying differences between groups or classes in certain contexts can be of the utmost importance. Contrast set mining relies on discovering significant patterns by contrasting two or more groups. A contrast set is a conjunction of attribute-value pairs that differ meaningfully in its distribution across groups. A previously proposed technique is rules for contrast sets, which seeks to express each contrast set found in terms of rules. This work extends rules for contrast sets to a temporal data mining task. We define a set of temporal patterns in order to capture the significant changes in the contrasts discovered along the considered time line. To evaluate the proposal accuracy and ability to discover relevant information, two different real-life data sets were studied using this approach.

CloseRead Abstract

2020

Sequence Mining for Automatic Generation of Software Tests from GUI Event Traces

Authors
Oliveira, A; Freitas, R; Jorge, A; Amorim, V; Moniz, N; Paiva, ACR; Azevedo, PJ;

Publication
Intelligent Data Engineering and Automated Learning - IDEAL 2020 - 21st International Conference, Guimaraes, Portugal, November 4-6, 2020, Proceedings, Part II

Abstract
In today’s software industry, systems are constantly changing. To maintain their quality and to prevent failures at controlled costs is a challenge. One way to foster quality is through thorough and systematic testing. Therefore, the definition of adequate tests is crucial for saving time, cost and effort. This paper presents a framework that generates software test cases automatically based on user interaction data. We propose a data-driven software test generation solution that combines the use of frequent sequence mining and Markov chain modeling. We assess the quality of the generated test cases by empirically evaluating their coverage with respect to observed user interactions and code. We also measure the plausibility of the distribution of the events in the generated test sets using the Kullback-Leibler divergence. © 2020, Springer Nature Switzerland AG.

CloseRead Abstract

2023

Subgroup mining for performance analysis of regression models

Authors
Pimentel, J; Azevedo, PJ; Torgo, L;

Publication
EXPERT SYSTEMS

Abstract
Machine learning algorithms have shown several advantages compared to humans, namely in terms of the scale of data that can be analysed, delivering high speed and precision. However, it is not always possible to understand how algorithms work. As a result of the complexity of some algorithms, users started to feel the need to ask for explanations, boosting the relevance of Explainable Artificial Intelligence. This field aims to explain and interpret models with the use of specific analytical methods that usually analyse how their predicted values and/or errors behave. While prediction analysis is widely studied, performance analysis has limitations for regression models. This paper proposes a rule-based approach, Error Distribution Rules (EDRs), to uncover atypical error regions, while considering multivariate feature interactions without size restrictions. Extracting EDRs is a form of subgroup mining. EDRs are model agnostic and a drill-down technique to evaluate regression models, which consider multivariate interactions between predictors. EDRs uncover regions of the input space with deviating performance providing an interpretable description of these regions. They can be regarded as a complementary tool to the standard reporting of the expected average predictive performance. Moreover, by providing interpretable descriptions of these specific regions, EDRs allow end users to understand the dangers of using regression tools for some specific cases that fall on these regions, that is, they improve the accountability of models. The performance of several models from different problems was studied, showing that our proposal allows the analysis of many situations and direct model comparison. In order to facilitate the examination of rules, two visualization tools based on boxplots and density plots were implemented. A network visualization tool is also provided to rapidly check interactions of every feature condition. An additional tool is provided by using a grid of boxplots, where comparison between quartiles of every distribution with a reference is performed. Based on this comparison, an extrapolation of counterfactual examples to regression was also implemented. A set of examples is described, including a setting where regression models performance is compared in detail using EDRs. Specifically, the error difference between two models in a dataset is studied by deriving rules highlighting regions of the input space where model performance difference is unexpected. The application of visual tools is illustrated using EDRs examples derived from public available datasets. Also, case studies illustrating the specialization of subgroups, identification of counter factual subgroups and detecting unanticipated complex models are presented. This paper extends the state of the art by providing a method to derive explanations for model performance instead of explanations for model predictions.

CloseRead Abstract

2009

Spatial Clustering of Molecular Dynamics Trajectories in Protein Unfolding Simulations

Authors
Ferreira, PG; Silva, CG; Azevedo, PJ; Brito, RMM;

Publication
COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS

Abstract
Molecular dynamics simulations is a valuable tool to study protein unfolding in silico. Analyzing the relative spatial position of the residues during the simulation may indicate which residues are essential in determining the protein structure. We present a method, inspired by a popular data mining technique called Frequent Itemset Mining, that clusters sets of amino acid residues with a synchronized trajectory during the unfolding process. The proposed approach has several advantages over traditional hierarchical clustering. © 2009 Springer Berlin Heidelberg.

CloseRead Abstract

2010

Rules for contrast sets

Authors
Azevedo, PJ;

Publication
INTELLIGENT DATA ANALYSIS

Abstract
In this paper we present a technique to derive rules describing contrast sets. Contrast sets are a formalism to represent groups differences. We propose a novel approach to describe directional contrasts using rules where the contrasting effect is partitioned into pairs of groups. Our approach makes use of a directional Fisher Exact Test to find significant differences across groups. We used a Bonferroni within-search adjustment to control type I errors and a pruning technique to prevent derivation of non significant contrast set specializations.

CloseRead Abstract

2007

Evaluating deterministic motif significance measures in protein databases

Authors
Ferreira, PG; Azevedo, PJ;

Publication
ALGORITHMS FOR MOLECULAR BIOLOGY

Abstract
Background: Assessing the outcome of motif mining algorithms is an essential task, as the number of reported motifs can be very large. Significance measures play a central role in automatically ranking those motifs, and therefore alleviating the analysis work. Spotting the most interesting and relevant motifs is then dependent on the choice of the right measures. The combined use of several measures may provide more robust results. However caution has to be taken in order to avoid spurious evaluations. Results: From the set of conducted experiments, it was verified that several of the selected significance measures show a very similar behavior in a wide range of situations therefore providing redundant information. Some measures have proved to be more appropriate to rank highly conserved motifs, while others are more appropriate for weakly conserved ones. Support appears as a very important feature to be considered for correct motif ranking. We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data. Finally, a visualization scheme was proposed that, when several measures are applied, enables an easy identification of high scoring motifs. Conclusion: In this work we have surveyed and categorized 14 significance measures for pattern evaluation. Their ability to rank three types of deterministic motifs was evaluated. Measures were applied in different testing conditions, where relations were identified. This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.

CloseRead Abstract