Publications

Publications by LIAAD

2008

A review on the combination of binary classifiers in multiclass problems

Authors
Lorena, AC; de Carvalho, ACPLF; Gama, JMP;

Publication
ARTIFICIAL INTELLIGENCE REVIEW

Abstract
Several real problems involve the classification of data into categories or classes. Given a data set containing data whose classes are known, Machine Learning algorithms can be employed for the induction of a classifier able to predict the class of new data from the same domain, performing the desired discrimination. Some learning techniques are originally conceived for the solution of problems with only two classes, also named binary classification problems. However, many problems require the discrimination of examples into more than two categories or classes. This paper presents a survey on the main strategies for the generalization of binary classifiers to problems with more than two classes, known as multiclass classification problems. The focus is on strategies that decompose the original multiclass problem into multiple binary subtasks, whose outputs are combined to obtain the final prediction.

CloseRead Abstract

2008

Learning from Data Streams: Synopsis and Change Detection

Authors
Sebastiao, R; Gama, J; Mendonca, T;

Publication
STAIRS 2008

Abstract
The aim of this PhD program is the study of algorithms for learning histograms, with the capacity of representing continuous high-speed flows of data and dealing with the current problem of change detection on data streams. In many modern applications, information is no longer gathered as finite stored data sets, but assuming the form of infinite data streams. As a large volume of information is produced at a high-speed rate it is no longer possible to use memory algorithms which require the full historic data stored in the main memory, so new ones are needed to process data online at the rate it is available. Moreover, the process generating data is not strictly stationary and evolves over time; so algorithms should, while extracting some sort of knowledge from this incessantly growing data, be able to adapt themselves to changes, maintaining a representation consistent with the most recent status of nature. In this work, we presented a feasible approach, using incremental histograms and monitoring data distributions, to detect concept drift in data stream context.

CloseRead Abstract

2008

Learning Model Trees from Data Streams

Authors
Ikonotnovska, E; Gama, J;

Publication
DISCOVERY SCIENCE, PROCEEDINGS

Abstract
In this paper we propose a fast and incremental algorithm for learning model trees from data streams (FIMT) for regression problems. The algorithm is incremental, works online, processes examples once at the speed they arrive, and maintains an any-time regression model. The leaves contain linear-models trained online from the examples that fall at that leaf, a process with low complexity. The use of linear models in the leaves increases its any-time global performance. FIMT is able to obtain competitive accuracy with batch learners even for medium size datasets, but with better training time in an order of magnitude. We study the properties of FIMT over several artificial and real datasets and evaluate its sensitivity on the order of examples and the noise level.

CloseRead Abstract

2008

Improving the performance of an incremental algorithm driven by error margins

Authors
del Campo Avila, J; Ramos Jimenez, G; Gama, J; Morales Bueno, R;

Publication
INTELLIGENT DATA ANALYSIS

Abstract
Classification is a quite relevant task within data analysis field. This task is not a trivial task and different difficulties can arise depending on the nature of the problem. All these difficulties can become worse when the datasets are too large or when new information can arrive at any time. Incremental learning is an approach that can be used to deal with the classification task in these cases. It must alleviate, or solve, the problem of limited time and memory resources. One emergent approach uses concentration bounds to ensure that decisions are made when enough information supports them. IADEM is one of the most recent algorithms that use this approach. The aim of this paper is to improve the performance of this algorithm in different ways: simplifying the complexity of the induced models, adding the ability to deal with continuous data, improving the detection of noise, selecting new criteria for evolutionating the model, including the use of more powerful prediction techniques, etc. Besides these new properties, the new system, IADEM-2, preserves the ability to obtain a performance similar to standard learning algorithms independently of the datasets size and it can incorporate new information as the basic algorithm does: using short time per example.

CloseRead Abstract

2008

Knowledge discovery from data streams

Authors
Gama, J; Aguilar Ruiz, J; Klinkenberg, R;

Publication
INTELLIGENT DATA ANALYSIS

Abstract

2008

Hierarchical clustering of time-series data streams

Authors
Rodrigues, PP; Gama, J; Pedroso, JP;

Publication
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Abstract
This paper presents and analyzes an incremental system for clustering streaming time series. The Online Divisive-Agglomerative Clustering (ODAC) system continuously maintains a tree-like hierarchy of clusters that evolves with data, using a top-down strategy. The splitting criterion is a correlation-based dissimilarity measure among time series, splitting each node by the farthest pair of streams. The system also uses a merge operator that reaggregates a previously split node in order to react to changes in the correlation structure between time series. The split and merge operators are triggered in response to changes in the diameters of existing clusters, assuming that in stationary environments, expanding the structure leads to a decrease in the diameters of the clusters. The system is designed to process thousands of data streams that flow at a high rate. The main features of the system include update time and memory consumption that do not depend on the number of examples in the stream. Moreover, the time and memory required to process an example decreases whenever the cluster structure expands. Experimental results on artificial and real data assess the processing qualities of the system, suggesting a competitive performance on clustering streaming time series, exploring also its ability to deal with concept drift.

CloseRead Abstract