Publications

Publications by João Gama

2020

Interconnect bypass fraud detection: a case study

Authors
Veloso, B; Tabassum, S; Martins, C; Espanha, R; Azevedo, R; Gama, J;

Publication
ANNALS OF TELECOMMUNICATIONS

Abstract
The high asymmetry of international termination rates is fertile ground for the appearance of fraud in telecom companies. International calls have higher values when compared with national ones, which raises the attention of fraudsters. In this paper, we present a solution for a real problem called interconnect bypass fraud, more specifically, a newly identified distributed pattern that crosses different countries and keeps fraudsters from being tracked by almost all fraud detection techniques. This problem is one of the most expressive in the telecommunication domain, and it has some abnormal behaviours like the occurrence of a burst of calls from specific numbers. Based on this assumption, we propose the adoption of a new fast forgetting technique that works together with the Lossy Counting algorithm. We apply frequent set mining to capture distributed patterns from different countries. Our goal is to detect as soon as possible items with abnormal behaviours, e.g., bursts of calls, repetitions, mirrors, distributed behaviours and a small number of calls spread by a vast set of destination numbers. The results show that the application of different techniques improves the detection ratio and not only complements the techniques used by the telecom company but also improves the performance of the Lossy Counting algorithm in terms of run-time, memory used and sensibility to detect the abnormal behaviours. Additionally, the application of frequent set mining allows us to capture distributed fraud patterns.

CloseRead Abstract

2021

A new self-organizing map based algorithm for multi-label stream classification

Authors
Cerri, R; Costa Júnior, JD; Faria, ER; Gama, J;

Publication
SAC

Abstract
Several algorithms have been proposed for offline multi-label classification. However, applications in areas such as traffic monitoring, social networks, and sensors produce data continuously, the so called data streams, posing challenges to batch multi-label learning. With the lack of stationarity in the distribution of data streams, new algorithms are needed to online adapt to such changes (concept drift). Also, in realistic applications, changes occur in scenarios with infinitely delayed labels, where the true classes of the arrival instances are never available. We propose an online unsupervised incremental method based on self-organizing maps for multi-label stream classification in scenarios with infinitely delayed labels. We consider the existence of an initial set of labeled instances to train a self-organizing map for each label. The learned models are then used and adapted in an evolving stream to classify new instances, considering that their classes will never be available. We adapt to incremental concept drifts by online updating the weight vectors of winner neurons and the dataset label cardinality. Predictions are obtained using the Bayes rule and the outputs of each neuron, adapting the prior probabilities and conditional probabilities of the classes in the stream. Experiments using synthetic and real datasets show that our method is highly competitive with several ones from the literature, in both stationary and concept drift scenarios.

CloseRead Abstract

2022

Methods and tools for causal discovery and causal inference

Authors
Nogueira, AR; Pugnana, A; Ruggieri, S; Pedreschi, D; Gama, J;

Publication
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
Causality is a complex concept, which roots its developments across several fields, such as statistics, economics, epidemiology, computer science, and philosophy. In recent years, the study of causal relationships has become a crucial part of the Artificial Intelligence community, as causality can be a key tool for overcoming some limitations of correlation-based Machine Learning systems. Causality research can generally be divided into two main branches, that is, causal discovery and causal inference. The former focuses on obtaining causal knowledge directly from observational data. The latter aims to estimate the impact deriving from a change of a certain variable over an outcome of interest. This article aims at covering several methodologies that have been developed for both tasks. This survey does not only focus on theoretical aspects. But also provides a practical toolkit for interested researchers and practitioners, including software, datasets, and running examples. This article is categorized under: Algorithmic Development > Causality Discovery Fundamental Concepts of Data and Knowledge > Explainable AI Technologies > Machine Learning

CloseRead Abstract

2020

Profiling high leverage points for detecting anomalous users in telecom data networks

Authors
Tabassum, S; Azad, MA; Gama, J;

Publication
ANNALS OF TELECOMMUNICATIONS

Abstract
Fraud in telephony incurs huge revenue losses and causes a menace to both the service providers and legitimate users. This problem is growing alongside augmenting technologies. Yet, the works in this area are hindered by the availability of data and confidentiality of approaches. In this work, we deal with the problem of detecting different types of unsolicited users from spammers to fraudsters in a massive phone call network. Most of the malicious users in telecommunications have some of the characteristics in common. These characteristics can be defined by a set of features whose values are uncommon for normal users. We made use of graph-based metrics to detect profiles that are significantly far from the common user profiles in a real data log with millions of users. To achieve this, we looked for the high leverage points in the 99.99th percentile, which identified a substantial number of users as extreme anomalous points. Furthermore, clustering these points helped distinguish malicious users efficiently and minimized the problem space significantly. Convincingly, the learned profiles of these detected users coincided with fraudulent behaviors.

CloseRead Abstract

2022

Semi-causal decision trees

Authors
Nogueira, AR; Ferreira, CA; Gama, J;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE

Abstract
Typically, classification algorithms use correlation analysis to make decisions. However, these decisions and the models they learn are not easily understandable for the typical user. Causal discovery is the field that studies the means to find causal relationships in observational data. Although highly interpretable, causal discovery algorithms tend to not perform so well in classification problems. This paper aims to propose a hybrid decision tree approach (SC tree) that mixes causal discovery with correlation analysis through the implementation of a custom metric to split the data in the tree's construction (Semi-causal gain ratio). In the results, the proposed methodology obtained a significant performance improvement (11.26% mean error rate) when compared to several causal baselines CDT-PS (23.67% ) and CDT-SPS (25.14%), matching closely the performance of J48 (10.20%), used as a correlation baseline, in ten binary data sets. Besides, when compared with PC in discrete data sets, the proposed approach obtained substantial improvement (16.17% against 28.07% in terms of mean error rate).

CloseRead Abstract

2021

Tensor decomposition for analysing time-evolving social networks: an overview

Authors
Fernandes, S; Fanaee T, H; Gama, J;

Publication
ARTIFICIAL INTELLIGENCE REVIEW

Abstract
Social networks are becoming larger and more complex as new ways of collecting social interaction data arise (namely from online social networks, mobile devices sensors, ...). These networks are often large-scale and of high dimensionality. Therefore, dealing with such networks became a challenging task. An intuitive way to deal with this complexity is to resort to tensors. In this context, the application of tensor decomposition has proven its usefulness in modelling and mining these networks: it has not only been applied for exploratory analysis (thus allowing the discovery of interaction patterns), but also for more demanding and elaborated tasks such as community detection and link prediction. In this work, we provide an overview of the methods based on tensor decomposition for the purpose of analysing time-evolving social networks from various perspectives: from community detection, link prediction and anomaly/event detection to network summarization and visualization. In more detail, we discuss the ideas exploited to carry out each social network analysis task as well as its limitations in order to give a complete coverage of the topic.

CloseRead Abstract