Publications

Publications by João Gama

2017

An evolutionary algorithm for clustering data streams with a variable number of clusters

Authors
Silva, JD; Hruschka, ER; Gama, J;

Publication
EXPERT SYSTEMS WITH APPLICATIONS

Abstract
Several algorithms for clustering data streams based on k-Means have been proposed in the literature. However, most of them assume that the number of clusters, k, is known a priori by the user and can be kept fixed throughout the data analySis process. Besides the difficulty in choosing k, data stream clustering imposes several challenges to be addressed, such as addressing non-stationary, unbounded data that arrive in an online fashion. In this paper, we propose a Fast Evolutionary Algorithm for Clustering data streams (FEAC-Stream) that allows estimating k automatically from data in an online fashion. FEAC-Stream uses the Page-Hinkley Test to detect eventual degradation in the quality of the induced clusters, thereby triggering an evolutionary algorithm that re-estimates k accordingly. FEAC-Stream relies on the assumption that clusters of (partially unknown) data can provide useful information about the dynamics of the data stream. We illustrate the potential of FEAC-Stream in a set of experiments using both synthetic and real-world data streams, comparing it to four related algorithms, namely: CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk and StreamKM++-BkM. The obtained results show that FEAC-Stream provides good data partitions and that it can detect, and accordingly react to, data changes.

CloseRead Abstract

2018

Proceedings of the Workshop on Large-scale Learning from Data Streams in Evolving Environments (STREAMEVOLV 2016) co-located with the 2016 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2016), Riva del Garda, Italy, September 23, 2016

Authors
Mouchaweh, MS; Bouchachia, H; Gama, J; Ribeiro, RP;

Publication
STREAMEVOLV@ECML-PKDD

Abstract

2015

Data Stream Classification Based on the Gamma Classifier

Authors
Valeria Uriarte Arcia, AV; Lopez Yanez, I; Yanez Marquez, C; Gama, J; Camacho Nieto, O;

Publication
MATHEMATICAL PROBLEMS IN ENGINEERING

Abstract
The ever increasing data generation confronts us with the problem of handling online massive amounts of information. One of the biggest challenges is how to extract valuable information from these massive continuous data streams during single scanning. In a data stream context, data arrive continuously at high speed; therefore the algorithms developed to address this context must be efficient regarding memory and time management and capable of detecting changes over time in the underlying distribution that generated the data. This work describes a novel method for the task of pattern classification over a continuous data stream based on an associative model. The proposed method is based on the Gamma classifier, which is inspired by the Alpha-Beta associative memories, which are both supervised pattern recognition models. The proposed method is capable of handling the space and time constrain inherent to data stream scenarios. The Data Streaming Gamma classifier (DS-Gamma classifier) implements a sliding window approach to provide concept drift detection and a forgetting mechanism. In order to test the classifier, several experiments were performed using different data stream scenarios with real and synthetic data streams. The experimental results show that the method exhibits competitive performance when compared to other state-of-the-art algorithms.

CloseRead Abstract

2015

Concept Drift Detection with Clustering via Statistical Change Detection Methods

Authors
Sakamoto, Y; Fukui, K; Gama, J; Nicklas, D; Moriyama, K; Numao, M;

Publication
2015 SEVENTH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE)

Abstract
We propose a concept drift detection method utilizing statistical change detection in which a drift detection method and the Page-Hinkley test are employed. Our method enables users to annotate clustering results without constructing a model of drift detection for every input. In our experiments using synthetic data, we evaluated our proposed method on the basis of detection delay and false detection, also revealed relations between the degree of drift and parameters of the method.

CloseRead Abstract

2018

Multi-label classification from high-speed data streams with adaptive model rules and random rules

Authors
Sousa, R; Gama, J;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE

Abstract
Multi-label classification is a methodology that tries to solve classification problems where multiple classes are associated with each data example. Data streams pose new challenges to this methodology caused by the massive amounts of structured data production. In fact, most of the existent batch mode methods may not support this condition. Therefore, this paper proposes two multi-label classification methods based on rule and ensembles learning from continuous flow of data. These methods are derived from a multi-target regression algorithm. The main contribution of this work is the rule specialization for subsets of class labels, instead of the usual local (individual models for each output) or a global (one model for all outputs) methods. Prequential evaluation was conducted where global, local and subset operation modes were compared against other online classifiers found in the literature. Six real-world data sets were used. The evaluation demonstrated that the subset specialization presents competitive performance, when compared to local and global approaches and online classifiers found in the literature.

CloseRead Abstract

2015

Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency

Authors
de Souza, VMA; Silva, DF; Gama, J; Batista, GEAPA;

Publication
SDM

Abstract
Data stream classification algorithms for nonstationary environments frequently assume the availability of class labels, instantly or with some lag after the classification. However, certain applications, mainly those related to sensors and robotics, involve high costs to obtain new labels during the classification phase. Such a scenario in which the actual labels of processed data are never available is called extreme verification latency. Extreme verification latency requires new classification methods capable of adapting to possible changes over time without external supervision. This paper presents a fast, simple, intuitive and accurate algorithm to classify nonstationary data streams in an extreme verification latency scenario, namely Stream Classification Algorithm Guided by Clustering - SCARGC. Our method consists of a clustering followed by a classification step applied repeatedly in a closed loop fashion. We show in several classification tasks evaluated in synthetic and real data that our method is faster and more accurate than the state-of-the-art.

CloseRead Abstract