2004
Authors
Gama, J; Medas, P; Rocha, R;
Publication
Proceedings of the ACM Symposium on Applied Computing
Abstract
This paper presents an hybrid adaptive system for induction of forest of trees from data streams. The Ultra Fast Forest Tree system (UFFT) is an incremental algorithm, with constant time for processing each example, works online, and uses the Hoeffding bound to decide when to install a splitting test in a leaf leading to a decision node. Our system has been designed for continuous data. It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test. The number of examples required to evaluate the splitting criteria is sound, based on the Hoeffding bound. For multiclass problems,the algorithm builds a binary tree for each possible pair of classes, leading to a forest of trees. During the training phase the algorithm maintains a short term memory. Given a data stream, a fixed number of the most recent examples are maintained in a data-structure that supports constant time insertion and deletion. When a test is installed, a leaf is transformed into a decision node with two descendant leaves. The sufficient statistics of these leaves are initialized with the examples in the short term memory that will fall at these leaves. We study the behavior of UFFT in different problems. The experimental results shows that UFFT is competitive against a batch decision tree learner in large and medium datasets.
2004
Authors
Kalousis, A; Gama, J; Hilario, M;
Publication
MACHINE LEARNING
Abstract
In this paper we address two symmetrical issues, the discovery of similarities among classification algorithms, and among datasets. Both on the basis of error measures, which we use to define the error correlation between two algorithms, and determine the relative performance of a list of algorithms. We use the first to discover similarities between learners, and both of them to discover similarities between datasets. The latter sketch maps on the dataset space. Regions within each map exhibit specific patterns of error correlation or relative performance. To acquire an understanding of the factors determining these regions we describe them using simple characteristics of the datasets. Descriptions of each region are given in terms of the distributions of dataset characteristics within it.
2004
Authors
Gama, J; Medas, P;
Publication
Pattern Recognition in Information Systems, Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems, PRIS 2004, In conjunction with ICEIS 2004, Porto, Portugal, April 2004
Abstract
2004
Authors
Kubat, M; Gama, J; Utgoff, P;
Publication
Intelligent Data Analysis
Abstract
2004
Authors
Fonseca, N; Costa, VS; Silva, F; Camacho, R;
Publication
INDUCTIVE LOGIC PROGRAMMING, PROCEEDINGS
Abstract
ILP systems induce first-order clausal theories performing a search through very large hypotheses spaces containing redundant hypotheses. The generation of redundant hypotheses may prevent the systems from finding good models and increases the time to induce them. In this paper we propose a classification of hypotheses redundancy and show how expert knowledge can be provided to an ILP system to avoid it. Experimental results show that the number of hypotheses generated and execution time are reduced when expert knowledge is used to avoid redundancy.
2004
Authors
Giraud Carrier, C; Vilalta, R; Brazdil, P;
Publication
MACHINE LEARNING
Abstract
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.