Publications

Publications by João Gama

2021

A new self-organizing map based algorithm for multi-label stream classification

Authors
Cerri, R; Costa Jínior, JD; Faria, ER; Gama, J;

Publication
SAC '21: The 36th ACM/SIGAPP Symposium on Applied Computing, Virtual Event, Republic of Korea, March 22-26, 2021

Abstract
Several algorithms have been proposed for offline multi-label classification. However, applications in areas such as traffic monitoring, social networks, and sensors produce data continuously, the so called data streams, posing challenges to batch multi-label learning. With the lack of stationarity in the distribution of data streams, new algorithms are needed to online adapt to such changes (concept drift). Also, in realistic applications, changes occur in scenarios with infinitely delayed labels, where the true classes of the arrival instances are never available. We propose an online unsupervised incremental method based on self-organizing maps for multi-label stream classification in scenarios with infinitely delayed labels. We consider the existence of an initial set of labeled instances to train a self-organizing map for each label. The learned models are then used and adapted in an evolving stream to classify new instances, considering that their classes will never be available. We adapt to incremental concept drifts by online updating the weight vectors of winner neurons and the dataset label cardinality. Predictions are obtained using the Bayes rule and the outputs of each neuron, adapting the prior probabilities and conditional probabilities of the classes in the stream. Experiments using synthetic and real datasets show that our method is highly competitive with several ones from the literature, in both stationary and concept drift scenarios. © 2021 ACM.

CloseRead Abstract

2022

Methods and tools for causal discovery and causal inference

Authors
Nogueira, AR; Pugnana, A; Ruggieri, S; Pedreschi, D; Gama, J;

Publication
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
Causality is a complex concept, which roots its developments across several fields, such as statistics, economics, epidemiology, computer science, and philosophy. In recent years, the study of causal relationships has become a crucial part of the Artificial Intelligence community, as causality can be a key tool for overcoming some limitations of correlation-based Machine Learning systems. Causality research can generally be divided into two main branches, that is, causal discovery and causal inference. The former focuses on obtaining causal knowledge directly from observational data. The latter aims to estimate the impact deriving from a change of a certain variable over an outcome of interest. This article aims at covering several methodologies that have been developed for both tasks. This survey does not only focus on theoretical aspects. But also provides a practical toolkit for interested researchers and practitioners, including software, datasets, and running examples. This article is categorized under: Algorithmic Development > Causality Discovery Fundamental Concepts of Data and Knowledge > Explainable AI Technologies > Machine Learning

CloseRead Abstract

2020

Profiling high leverage points for detecting anomalous users in telecom data networks

Authors
Tabassum, S; Azad, MA; Gama, J;

Publication
ANNALS OF TELECOMMUNICATIONS

Abstract
Fraud in telephony incurs huge revenue losses and causes a menace to both the service providers and legitimate users. This problem is growing alongside augmenting technologies. Yet, the works in this area are hindered by the availability of data and confidentiality of approaches. In this work, we deal with the problem of detecting different types of unsolicited users from spammers to fraudsters in a massive phone call network. Most of the malicious users in telecommunications have some of the characteristics in common. These characteristics can be defined by a set of features whose values are uncommon for normal users. We made use of graph-based metrics to detect profiles that are significantly far from the common user profiles in a real data log with millions of users. To achieve this, we looked for the high leverage points in the 99.99th percentile, which identified a substantial number of users as extreme anomalous points. Furthermore, clustering these points helped distinguish malicious users efficiently and minimized the problem space significantly. Convincingly, the learned profiles of these detected users coincided with fraudulent behaviors.

CloseRead Abstract

2022

Semi-causal decision trees

Authors
Nogueira, AR; Ferreira, CA; Gama, J;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE

Abstract
Typically, classification algorithms use correlation analysis to make decisions. However, these decisions and the models they learn are not easily understandable for the typical user. Causal discovery is the field that studies the means to find causal relationships in observational data. Although highly interpretable, causal discovery algorithms tend to not perform so well in classification problems. This paper aims to propose a hybrid decision tree approach (SC tree) that mixes causal discovery with correlation analysis through the implementation of a custom metric to split the data in the tree's construction (Semi-causal gain ratio). In the results, the proposed methodology obtained a significant performance improvement (11.26% mean error rate) when compared to several causal baselines CDT-PS (23.67% ) and CDT-SPS (25.14%), matching closely the performance of J48 (10.20%), used as a correlation baseline, in ten binary data sets. Besides, when compared with PC in discrete data sets, the proposed approach obtained substantial improvement (16.17% against 28.07% in terms of mean error rate).

CloseRead Abstract

2021

Tensor decomposition for analysing time-evolving social networks: an overview

Authors
Fernandes, S; Fanaee T, H; Gama, J;

Publication
ARTIFICIAL INTELLIGENCE REVIEW

Abstract
Social networks are becoming larger and more complex as new ways of collecting social interaction data arise (namely from online social networks, mobile devices sensors, ...). These networks are often large-scale and of high dimensionality. Therefore, dealing with such networks became a challenging task. An intuitive way to deal with this complexity is to resort to tensors. In this context, the application of tensor decomposition has proven its usefulness in modelling and mining these networks: it has not only been applied for exploratory analysis (thus allowing the discovery of interaction patterns), but also for more demanding and elaborated tasks such as community detection and link prediction. In this work, we provide an overview of the methods based on tensor decomposition for the purpose of analysing time-evolving social networks from various perspectives: from community detection, link prediction and anomaly/event detection to network summarization and visualization. In more detail, we discuss the ideas exploited to carry out each social network analysis task as well as its limitations in order to give a complete coverage of the topic.

CloseRead Abstract

2021

Using network features for credit scoring in microfinance

Authors
Paraiso, P; Ruiz, S; Gomes, P; Rodrigues, L; Gama, J;

Publication
INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS

Abstract
The usage of non-traditional data in credit scoring, from microfinance institutions, is very useful when trying to address the problem, very common in emerging markets, of the lack of a verifiable customers' credit history. In this context, this paper relies on data acquired from smartphones in a loan classification problem. We conduct a set of experiments concerning feature selection, strategies to deal with imbalanced datasets and algorithm choice, to define a baseline model. This model is, then, compared to others adding network features to the original ones. For that comparison, we generate a network that links a given user to its phone book contacts which are users of a given mobile application, taking into account the ethics and privacy concerns involved, and use some feature extraction techniques, such as the introduction of centrality measures and the definition of node embeddings, in order to capture certain aspects of the network's topology. Several node embedding algorithms are tested, but only Node2Vec proves to be significantly better than the baseline model, applying Friedman's post hoc tests. This node embedding algorithm outperforms all the other, representing a relative improvement, in comparison with the baseline model, of 5.74% on the mean accuracy, 7.13% on the area under the Receiver Operating Characteristic curve and 30.83% on the Kolmogorov-Smirnov statistic scores. This method, therefore, proves to be very promising when trying to discriminate between "good" and "bad" customers, in credit scoring classification problems.

CloseRead Abstract