Publications

Publications by LIAAD

2008

Knowledge discovery from sensor data

Authors
Ganguly, AR; Gama, J; Omitaomu, OA; Gaber, MM; Vatsavai, RR;

Publication
Knowledge Discovery from Sensor Data

Abstract
As sensors become ubiquitous, a set of broad requirements is beginning to emerge across high-priority applications including disaster preparedness and management, adaptability to climate change, national or homeland security, and the management of critical infrastructures. This book presents innovative solutions in offline data mining and real-time analysis of sensor or geographically distributed data. It discusses the challenges and requirements for sensor data based knowledge discovery solutions in high-priority application illustrated with case studies. It explores the fusion between heterogeneous data streams from multiple sensor types and applications in science, engineering, and security. © 2009 by Taylor & Francis Group, LLC.

CloseRead Abstract

2008

Introduction

Authors
Ganguly, AR; Gama, J; Omitaomu, OA; Gaber, MM; Vatsavai, RR;

Publication
Knowledge Discovery from Sensor Data

Abstract

2008

Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks

Authors
Spinosa, EJ; de Carvalho, APDF; Gama, J;

Publication
APPLIED COMPUTING 2008, VOLS 1-3

Abstract
In this paper, a cluster-based novelty detection technique capable of dealing with a large amount of data is presented and evaluated in the context of intrusion detection. Starting with examples of a single class that describe the normal profile, the proposed technique detects novel concepts initially as cohesive clusters of examples and later as sets of clusters in an unsupervised incremental learning fashion. Experimental results with the KDD Cup 1999 data set show that the technique is capable of dealing with data streams, successfully learning novel concepts that are pure in terms of the real class structure.

CloseRead Abstract

2008

Robust Division in Clustering of Streaming Time Series

Authors
Rodrigues, PP; Gama, J;

Publication
ECAI 2008, PROCEEDINGS

Abstract
Online learning algorithms which address fast data streams should process examples at the rate they arrive, using a single scan of data and fixed memory, maintaining a decision model at any time and being able to adapt the model to the most recent data. These features yield the necessity of using approximate models. One problem that usually arises with approximate models is the definition of a minimum number of observations necessary to assure convergence, which implies a high risk since the system may have to decide based only on a small subset of the entire data. One approach is to apply techniques based on the Hoeffding bound to enforce decisions with a confidence level. In divisive clustering of time series, the goal is to find clusters of similar time series over time. In online approaches there are two decisions to make: when to split and how to assign variables to new clusters. We can define a confidence level to both the decision of splitting and the assignment of data variables to new clusters. Previous works have already addressed confident decisions on the moment of split. Our proposal is to include a confidence level to the assignment process. When a split point is reported, creating two new clusters, we can directly assign points which are confidently closer to one cluster than the other, having a different strategy for those variables which do not satisfy the confidence level. In this paper we propose to assign the unsure variables to a third cluster. Experimental evaluation is presented in the context of a recently proposed hierarchical algorithm, assessing the advantages of the proposal, revealing also advantages on memory usage reduction and processing speed. Although this proposal is evaluated under the scope of an existent method, it can be generalized to any divisive procedure.

CloseRead Abstract

2008

Clustering Distributed Sensor Data Streams

Authors
Rodrigues, PP; Gama, J; Lopes, L;

Publication
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART II, PROCEEDINGS

Abstract
Nowadays applications produce infinite streams of data distributed across wide sensor networks. In this work we study the problem of continuously maintain a cluster structure over the data points generated by the entire network. Usual techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream. In this paper, we propose DGClust, a new distributed algorithm which reduces both the dimensionality and the communication burdens, by allowing each local sensor to keep an online discretization of its data stream, which operates with constant update time and (almost) fixed space. Each new data point triggers a cell in this univariate grid, reflecting the current state of the data stream at the local site. Whenever a local site changes its state, it notifies the central server about the new state it is in. This way, at each point in time, the central site has the global multivariate state of the entire network. To avoid monitoring all possible states, which is exponential in the number of sensors, the central site keeps a small list of counters of the most frequent global states. Finally, a simple adaptive partitional clustering algorithm is applied to the frequent states central points in order to provide an anytime definition of the clusters centers. The approach is evaluated in the context of distributed sensor networks, presenting both empirical and theoretical evidence of its advantages.

CloseRead Abstract

2008

Improving the performance of an incremental algorithm driven by error margins

Authors
del Campo Avilaa, J; Ramos Jimeneza, G; Gamab, J; Morales Buenoa, R;

Publication
Intelligent Data Analysis

Abstract
Classification is a quite relevant task within data analysis field. This task is not a trivial task and different difficulties can arise depending on the nature of the problem. All these difficulties can become worse when the datasets are too large or when new information can arrive at any time. Incremental learning is an approach that can be used to deal with the classification task in these cases. It must alleviate, or solve, the problem of limited time and memory resources. One emergent approach uses concentration bounds to ensure that decisions are made when enough information supports them. IADEM is one of the most recent algorithms that use this approach. The aim of this paper is to improve the performance of this algorithm in different ways: simplifying the complexity of the induced models, adding the ability to deal with continuous data, improving the detection of noise, selecting new criteria for evolutionating the model, including the use of more powerful prediction techniques, etc. Besides these new properties, the new system, IADEM-2, preserves the ability to obtain a performance similar to standard learning algorithms independently of the datasets size and it can incorporate new information as the basic algorithm does: using short time per example.

CloseRead Abstract