Publications

Publications by João Gama

2008

Introduction

Authors
Ganguly, AR; Gama, J; Omitaomu, OA; Gaber, MM; Vatsavai, RR;

Publication
Knowledge Discovery from Sensor Data

Abstract

2007

OLINDDA: A cluster-based approach for detecting novelty and concept drift in data streams

Authors
Spinosa, EJ; de Carvalho, APDF; Gama, J;

Publication
APPLIED COMPUTING 2007, VOL 1 AND 2

Abstract
A machine learning approach that is capable of treating data streams presents new challenges and enables the analysis of a variety of real problems in which concepts change over time. In this scenario, the ability to identify novel concepts as well as to deal with concept drift axe two important attributes. This paper presents a technique based on the k-means clustering algorithm aimed at considering those two situations in a single learning strategy. Experimental results performed with data from various domains provide insight into how clustering algorithms can be used for the discovery of new concepts in streams of data.

CloseRead Abstract

2008

Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks

Authors
Spinosa, EJ; de Carvalho, APDF; Gama, J;

Publication
APPLIED COMPUTING 2008, VOLS 1-3

Abstract
In this paper, a cluster-based novelty detection technique capable of dealing with a large amount of data is presented and evaluated in the context of intrusion detection. Starting with examples of a single class that describe the normal profile, the proposed technique detects novel concepts initially as cohesive clusters of examples and later as sets of clusters in an unsupervised incremental learning fashion. Experimental results with the KDD Cup 1999 data set show that the technique is capable of dealing with data streams, successfully learning novel concepts that are pure in terms of the real class structure.

CloseRead Abstract

2009

Adaptive Bayesian network classifiers

Authors
Castillo, G; Gama, J;

Publication
INTELLIGENT DATA ANALYSIS

Abstract
This paper is concerned with adaptive learning algorithms for Bayesian network classifiers in a prequential (on-line) learning scenario. In this scenario, new data is available over time. An efficient supervised learning algorithm must be able to improve its predictive accuracy by incorporating the incoming data, while optimizing the cost of updating. However, if the process is not strictly stationary, the target concept could change over time. Hence, the predictive model should be adapted quickly to these changes. The main contribution of this work is a proposal of an unified, adaptive prequential framework for supervised learning called AdPreqFr4SL, which attempts to handle the cost-performance trade-off and deal with concept drift. Starting with the simple Naive Bayes, we scale up the complexity by gradually increasing the maximum number of allowable attribute dependencies, and then by searching for new dependences in the extended search space. Since updating the structure is a costly task, we use new data to primarily adapt the parameters. We adapt the structure only when is actually necessary. The method for handling concept drift is based on the Shewhart P-Chart. We experimentally prove the advantages of using the AdPreqFr4SL in comparison with its non-adaptive versions.

CloseRead Abstract

2009

Knowledge discovery from data streams Introduction

Authors
Gama, J; Ganguly, A; Omitaomu, O; Vatsavai, R; Gaber, M;

Publication
INTELLIGENT DATA ANALYSIS

Abstract

2011

Online Evaluation of Email Streaming Classifiers Using GNUsmail

Authors
Carmona Cejudo, JM; Baena Garcia, M; del Campo Avila, J; Bifet, A; Gama, J; Morales Bueno, R;

Publication
ADVANCES IN INTELLIGENT DATA ANALYSIS X: IDA 2011

Abstract
Real-time email classification is a challenging task because of its online nature, subject to concept-drift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither cross-validation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an open-source extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmail's architecture supports incremental and online learning, and it can be used to compare different online mining methods, using state-of-art evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.

CloseRead Abstract