2007
Authors
Spinosa, EJ; de Carvalho, APDF; Gama, J;
Publication
APPLIED COMPUTING 2007, VOL 1 AND 2
Abstract
A machine learning approach that is capable of treating data streams presents new challenges and enables the analysis of a variety of real problems in which concepts change over time. In this scenario, the ability to identify novel concepts as well as to deal with concept drift axe two important attributes. This paper presents a technique based on the k-means clustering algorithm aimed at considering those two situations in a single learning strategy. Experimental results performed with data from various domains provide insight into how clustering algorithms can be used for the discovery of new concepts in streams of data.
2008
Authors
Spinosa, EJ; de Carvalho, APDF; Gama, J;
Publication
APPLIED COMPUTING 2008, VOLS 1-3
Abstract
In this paper, a cluster-based novelty detection technique capable of dealing with a large amount of data is presented and evaluated in the context of intrusion detection. Starting with examples of a single class that describe the normal profile, the proposed technique detects novel concepts initially as cohesive clusters of examples and later as sets of clusters in an unsupervised incremental learning fashion. Experimental results with the KDD Cup 1999 data set show that the technique is capable of dealing with data streams, successfully learning novel concepts that are pure in terms of the real class structure.
2009
Authors
Castillo, G; Gama, J;
Publication
INTELLIGENT DATA ANALYSIS
Abstract
This paper is concerned with adaptive learning algorithms for Bayesian network classifiers in a prequential (on-line) learning scenario. In this scenario, new data is available over time. An efficient supervised learning algorithm must be able to improve its predictive accuracy by incorporating the incoming data, while optimizing the cost of updating. However, if the process is not strictly stationary, the target concept could change over time. Hence, the predictive model should be adapted quickly to these changes. The main contribution of this work is a proposal of an unified, adaptive prequential framework for supervised learning called AdPreqFr4SL, which attempts to handle the cost-performance trade-off and deal with concept drift. Starting with the simple Naive Bayes, we scale up the complexity by gradually increasing the maximum number of allowable attribute dependencies, and then by searching for new dependences in the extended search space. Since updating the structure is a costly task, we use new data to primarily adapt the parameters. We adapt the structure only when is actually necessary. The method for handling concept drift is based on the Shewhart P-Chart. We experimentally prove the advantages of using the AdPreqFr4SL in comparison with its non-adaptive versions.
2009
Authors
Gama, J; Ganguly, A; Omitaomu, O; Vatsavai, R; Gaber, M;
Publication
INTELLIGENT DATA ANALYSIS
Abstract
2011
Authors
Carmona Cejudo, JM; Baena Garcia, M; del Campo Avila, J; Bifet, A; Gama, J; Morales Bueno, R;
Publication
ADVANCES IN INTELLIGENT DATA ANALYSIS X: IDA 2011
Abstract
Real-time email classification is a challenging task because of its online nature, subject to concept-drift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither cross-validation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an open-source extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmail's architecture supports incremental and online learning, and it can be used to compare different online mining methods, using state-of-art evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.
2012
Authors
Oliveira, M; Gama, J;
Publication
INTELLIGENT DATA ANALYSIS
Abstract
The study of evolution has become an important research issue, especially in the last decade, due to our ability to collect and store high detailed and time-stamped data. The need for describing and understanding the behavior of a given phenomena over time led to the emergence of new frameworks and methods focused on the temporal evolution of data and models. In this paper we address the problem of monitoring the evolution of clusters over time and propose the MEC framework. MEC traces evolution through the detection and categorization of clusters transitions, such as births, deaths and merges, and enables their visualization through bipartite graphs. It includes a taxonomy of transitions, a tracking method based in the computation of conditional probabilities, and a transition detection algorithm. We use MEC with two main goals: to determine the general evolution trends and to detect abnormal behavior or rare events. To demonstrate the applicability of our framework we present real world economic and financial case studies, using datasets extracted from Banco de Portugal Central Balance-Sheet Database and the The Data Page of New York University -Leonard N. Stern School of Business. The results allow us to draw interesting conclusions about the evolution of activity sectors and European companies.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.