Publications

Publications by João Gama

2019

Pruned Sets for Multi-Label Stream Classification without True Labels

Authors
Costa, JD; Faria, ER; Silva, JA; Gama, J; Cerri, R;

Publication
2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

Abstract
In multi-label classification problems an example can be simultaneously classified into more than one class. This is also a challenging task in Data Streams (DS) classification, where unbounded and non-stationary distributed multi-label data contain multiple concepts that drift at different rates and patterns. In addition, the true labels of the examples may never become available and updating classification models in a supervised fashion is unfeasible. In this paper, we propose a Multi-Label Stream Classification (MLSC) method applying a Novelty Detection (ND) procedure task to update the classification model detecting any new patterns in the examples, which differ in some aspects from observed patterns, in an unsupervised fashion without any external feedback. Although ND is suitable for multi-class stream classification, it is still a not well-investigated task for multi-label problems. We improve a initial work proposed in [1] and extended it with a new Pruned Sets (PS) transformation strategy. The experiments showed that our method presents competitive performances over data sets with different concept drifts, and outperform, in some aspects, the baseline methods.

CloseRead Abstract

2020

A drift detection method based on dynamic classifier selection

Authors
Pinagé, F; dos Santos, EM; Gama, J;

Publication
DATA MINING AND KNOWLEDGE DISCOVERY

Abstract
Machine learning algorithms can be applied to several practical problems, such as spam, fraud and intrusion detection, and customer preferences, among others. In most of these problems, data come in streams, which mean that data distribution may change over time, leading to concept drift. The literature is abundant on providing supervised methods based on error monitoring for explicit drift detection. However, these methods may become infeasible in some real-world applications-where there is no fully labeled data available, and may depend on a significant decrease in accuracy to be able to detect drifts. There are also methods based on blind approaches, where the decision model is updated constantly. However, this may lead to unnecessary system updates. In order to overcome these drawbacks, we propose in this paper a semi-supervised drift detector that uses an ensemble of classifiers based on self-training online learning and dynamic classifier selection. For each unknown sample, a dynamic selection strategy is used to choose among the ensemble's component members, the classifier most likely to be the correct one for classifying it. The prediction assigned by the chosen classifier is used to compute an estimate of the error produced by the ensemble members. The proposed method monitors such a pseudo-error in order to detect drifts and to update the decision model only after drift detection. The achievement of this method is relevant in that it allows drift detection and reaction and is applicable in several practical problems. The experiments conducted indicate that the proposed method attains high performance and detection rates, while reducing the amount of labeled data used to detect drift.

CloseRead Abstract

2020

Spatiotemporal Traffic Anomaly Detection on Urban Road Network Using Tensor Decomposition Method

Authors
Tisljaric, L; Fernandes, S; Caric, T; Gama, J;

Publication
DS

Abstract
Tensor-based models emerged only recently in modeling and analysis of the spatiotemporal road traffic data. They outperform other data models regarding the property of simultaneously capturing both spatial and temporal components of the observed traffic dataset. In this paper, the nonnegative tensor decomposition method is used to extract traffic patterns in the form of Speed Transition Matrix (STM). The STM is presented as the approach for modeling the large sparse Floating Car Data (FCD). The anomaly of the traffic pattern is estimated using Kullback–Leibler divergence between the observed traffic pattern and the average traffic pattern. Experiments were conducted on the large sparse FCD dataset for the most relevant road segments in the City of Zagreb, which is the capital and largest city in Croatia. Results show that the method was able to detect the most anomalous traffic road segments, and with analysis of the extracted spatial and temporal components, conclusions could be drawn about the causes of the anomalies. Results are validated by using the domain knowledge from the Highway Capacity Manual and achieved a precision score value of more than 90%. Therefore, such valuable traffic information can be used in routing applications and urban traffic planning.

CloseRead Abstract

2016

Online Bagging for Recommendation with Incremental Matrix Factorization

Authors
Vinagre, J; Jorge, AM; Gama, J;

Publication
STREAMEVOLV@ECML-PKDD

Abstract
Online recommender systems often deal with continuous, potentially fast and unbounded ows of data. Ensemble methods for recommender systems have been used in the past in batch algorithms, however they have never been studied with incremental algorithms, that are capable of processing those data streams on the y. We propose online bagging, using an incremental matrix factorization algorithm for positiveonly data streams. Using prequential evaluation, we show that bagging is able to improve accuracy more than 20% over the baseline with small computational overhead.

CloseRead Abstract

2020

NORMO: A new method for estimating the number of components in CP tensor decomposition

Authors
Fernandes, S; Fanaee T, H; Gama, J;

Publication
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

Abstract
Tensor decompositions are multi-way analysis tools which have been successfully applied in a wide range of different fields. However, there are still challenges that remain few explored, namely the following: when applying tensor decomposition techniques, what should we expect from the result? How can we evaluate its quality? It is expected that, when the number of components is suitable, then few redundancy is observed in the decomposition result. Based on this assumption, we propose a new method, NORMO, which aims at estimating the number of components in CANDECOMP/PARAFAC (CP) decomposition so that no redundancy is observed in the result. To the best of our knowledge, this work encompasses the first attempt to tackle such problem. According to our experiments, the number of non-redundant components estimated by NORMO is among the most accurate estimates of the true CP number of components in both synthetic and real-world tensor datasets (thus validating the rationale guiding our method). Moreover, NORMO is more efficient than most of its competitors. Additionally, our method can be used to discover multi-levels of granularity in the patterns discovered.

CloseRead Abstract

2020

Assembled Feature Selection for Credit Scoring in Microfinance with Non-traditional Features

Authors
Ruiz, S; Gomes, P; Rodrigues, L; Gama, J;

Publication
DS

Abstract
Since early 2000, Microfinance Institutions (MFI) have been using credit scoring for their risk assessment. However, one of the main problems of credit scoring in microfinance is the lack of structured financial data. To address this problem, MFI have started using non-traditional data which can be extracted from the digital footprint of their users. The non-traditional data can be used to build algorithms that can identify good borrowers as in traditional banking. This paper proposes an assembled method to evaluate the predictive power of the non-traditional method. By using the Weight of Evidence (WoE), a transformation based on the distribution within the feature, as feature transformation method, and then applying extremely randomized trees for feature selection, we were able to improve the accuracy of the credit scoring model by 20.20% when compared to the credit scoring model built with the traditional implementation of WoE. This paper shows how the assembling of WoE with different feature selection criteria can result in more robust credit scoring models in microfinance.

CloseRead Abstract