Publications

Publications by LIAAD

2008

Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks

Authors
Spinosa, EJ; de Carvalho, APDF; Gama, J;

Publication
APPLIED COMPUTING 2008, VOLS 1-3

Abstract
In this paper, a cluster-based novelty detection technique capable of dealing with a large amount of data is presented and evaluated in the context of intrusion detection. Starting with examples of a single class that describe the normal profile, the proposed technique detects novel concepts initially as cohesive clusters of examples and later as sets of clusters in an unsupervised incremental learning fashion. Experimental results with the KDD Cup 1999 data set show that the technique is capable of dealing with data streams, successfully learning novel concepts that are pure in terms of the real class structure.

CloseRead Abstract

2008

Robust Division in Clustering of Streaming Time Series

Authors
Rodrigues, PP; Gama, J;

Publication
ECAI 2008, PROCEEDINGS

Abstract
Online learning algorithms which address fast data streams should process examples at the rate they arrive, using a single scan of data and fixed memory, maintaining a decision model at any time and being able to adapt the model to the most recent data. These features yield the necessity of using approximate models. One problem that usually arises with approximate models is the definition of a minimum number of observations necessary to assure convergence, which implies a high risk since the system may have to decide based only on a small subset of the entire data. One approach is to apply techniques based on the Hoeffding bound to enforce decisions with a confidence level. In divisive clustering of time series, the goal is to find clusters of similar time series over time. In online approaches there are two decisions to make: when to split and how to assign variables to new clusters. We can define a confidence level to both the decision of splitting and the assignment of data variables to new clusters. Previous works have already addressed confident decisions on the moment of split. Our proposal is to include a confidence level to the assignment process. When a split point is reported, creating two new clusters, we can directly assign points which are confidently closer to one cluster than the other, having a different strategy for those variables which do not satisfy the confidence level. In this paper we propose to assign the unsure variables to a third cluster. Experimental evaluation is presented in the context of a recently proposed hierarchical algorithm, assessing the advantages of the proposal, revealing also advantages on memory usage reduction and processing speed. Although this proposal is evaluated under the scope of an existent method, it can be generalized to any divisive procedure.

CloseRead Abstract

2008

Clustering Distributed Sensor Data Streams

Authors
Rodrigues, PP; Gama, J; Lopes, L;

Publication
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART II, PROCEEDINGS

Abstract
Nowadays applications produce infinite streams of data distributed across wide sensor networks. In this work we study the problem of continuously maintain a cluster structure over the data points generated by the entire network. Usual techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream. In this paper, we propose DGClust, a new distributed algorithm which reduces both the dimensionality and the communication burdens, by allowing each local sensor to keep an online discretization of its data stream, which operates with constant update time and (almost) fixed space. Each new data point triggers a cell in this univariate grid, reflecting the current state of the data stream at the local site. Whenever a local site changes its state, it notifies the central server about the new state it is in. This way, at each point in time, the central site has the global multivariate state of the entire network. To avoid monitoring all possible states, which is exponential in the number of sensors, the central site keeps a small list of counters of the most frequent global states. Finally, a simple adaptive partitional clustering algorithm is applied to the frequent states central points in order to provide an anytime definition of the clusters centers. The approach is evaluated in the context of distributed sensor networks, presenting both empirical and theoretical evidence of its advantages.

CloseRead Abstract

2008

Improving the performance of an incremental algorithm driven by error margins

Authors
del Campo Avilaa, J; Ramos Jimeneza, G; Gamab, J; Morales Buenoa, R;

Publication
Intelligent Data Analysis

Abstract
Classification is a quite relevant task within data analysis field. This task is not a trivial task and different difficulties can arise depending on the nature of the problem. All these difficulties can become worse when the datasets are too large or when new information can arrive at any time. Incremental learning is an approach that can be used to deal with the classification task in these cases. It must alleviate, or solve, the problem of limited time and memory resources. One emergent approach uses concentration bounds to ensure that decisions are made when enough information supports them. IADEM is one of the most recent algorithms that use this approach. The aim of this paper is to improve the performance of this algorithm in different ways: simplifying the complexity of the induced models, adding the ability to deal with continuous data, improving the detection of noise, selecting new criteria for evolutionating the model, including the use of more powerful prediction techniques, etc. Besides these new properties, the new system, IADEM-2, preserves the ability to obtain a performance similar to standard learning algorithms independently of the datasets size and it can incorporate new information as the basic algorithm does: using short time per example.

CloseRead Abstract

2008

Schema matching on streams with accuracy guarantees

Authors
Gama, J; Aguilar Ruiz, J; Klinkenberg, R;

Publication
Intelligent Data Analysis

Abstract
We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records - which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to find a matching that is a close approximation of the matching that would be obtained if the entire stream were processed. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method's correctness and report on experiments using large databases.

CloseRead Abstract

2008

The dimension of ECOCs for multiclass classification problems

Authors
Pimenta, E; Gama, J; Carvalho, A;

Publication
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS

Abstract
Several classification problems involve more than two classes. These problems are known as multiclass classification problems. One of the approaches to deal with multiclass problems is their decomposition into a set of binary problems. Recent work shows important advantages related with this approach. Several strategies have been proposed for this decomposition. The strategies most frequently used are All-vs-All, One-vs-All and Error Correction Output Codes (ECOC). ECOCs are based on binary words (codewords) and have been adapted to deal with multiclass problems. For such, they must comply with a number of specific constraints. Different dimensions may be adopted for the codewords for each number of classes in the problem. These dimensions grow exponentially with the number of classes present in a dataset. Two methods to choose the dimension of a ECOC, which assure a good trade-off between redundancy and error correction capacity, are proposed in this paper. The proposed methods are evaluated in a set of benchmark classification problems. Experimental results show that they are competitive with other multiclass decomposition methods.

CloseRead Abstract