Publicacoes - INESC TEC

Publicações

Publicações por João Gama

2007

Data stream processing

Autores
Gama, J; Rodrigues, PP;

Publicação
Learning from Data Streams: Processing Techniques in Sensor Networks

Abstract
The rapid growth in information science and technology in general and the complexity and volume of data in particular have introduced new challenges for the research community.Many sources produce data continuously. Examples include sensor networks, wireless networks, radio frequency identification (RFID), customer click streams, telephone records, multimedia data, scientific data, sets of retail chain transactions etc. These sources are called data streams. A data stream is an ordered sequence of instances that can be read only once or a small number of times using limited computing and storage capabilities. These sources of data are characterized by being open-ended, flowing at high-speed, and generated by non stationary distributions in dynamic environments. What distinguishes current data from earlier one is automatic data feeds. We do not just have people who are entering information into a computer. Instead, we have computers entering data into each other [25]. Nowadays there are applications in which the data are modeled best as transient data streams instead of as persistent tables. Examples of applications include network monitoring, user modeling in web applications, sensor networks in electrical networks, telecommunications data management, prediction in stock markets, monitoring radio frequency identification etc. In these applications it is not feasible to load the arriving data into a traditional data base management system (DBMS) and traditional DBMS are not designed to directly support the continuous queries required by these applications [3]. Carney et al. [6] pointed out the significant differences between data bases that are passive repositories of data and data bases that actually monitor applications and alert humans when abnormal activity is detected. In the former, only the current state of the data is relevant for analysis. Humans initiate queries, usually one-time, predefined queries. In the latter, data come from external sources (e.g., sensors), and require processing historic data. For example, in monitoring activity, queries should run continuously. The answer to a continuous query is produced over time, reflecting the data seen so far. Moreover, if the process is not strictly stationary (as most of real-world applications), the target concept could gradually change over time. For example, the type of abnormal activity (e.g., attacks in TCP/IP networks, frauds in credit card transactions etc.) changes over time. Organizations use decision support systems to identify potential useful patterns in data. Data analysis is complex, interactive, and exploratory over very large volumes of historic data, eventually stored in distributed environments. Traditional pattern discovery process requires online ad-hoc queries, not previously defined, that are successively refined. Nowadays, given the current trends in decision support and data analysis, the computer plays a much more active role, by searching hypotheses, evaluating and suggesting patterns. Due to the exploratory nature of these queries, an exact answer may not be required. A user may prefer a fast approximate answer. Range queries and selectivity estimation (the proportion of tuples that satisfy a query) are two illustrative examples where fast but approximate answers are more useful than slow and exact ones. Sensor networks are distributed environments producing multiple streams of data. We can consider the network as a distributed database we are interested in querying and mining. In this chapter we review the main techniques used for query and mining data streams that are of potential use in sensor networks. In Sect. 3.2 we refer to the data stream models and identify its main research challenges. Section 3.3 presents basic stream models. Section 3.4 present basic stream algorithms for maintaining synopsis over data streams. Section 3.5 concludes the chapter and points out future directions for research. © 2007 Springer-Verlag Berlin Heidelberg.

FecharLer Abstract

2011

MEC - Monitoring Clusters' Transitions

Autores
Oliveira, M; Gama, J;

Publicação
STAIRS 2010: PROCEEDINGS OF THE FIFTH STARTING AI RESEARCHERS' SYMPOSIUM

Abstract
In this work we address the problem of monitoring the evolution of clusters, which became an important research issue in recent years due to our ability to collect and store data that evolves over time. The evolution is traced through the detection and categorization of transitions undergone by clusters' structures computed at different points in time. We adopt two main strategies for cluster characterization - representation by enumeration and representation by comprehension -, and propose the MEC (Monitor of the Evolution of Clusters) framework, which was developed along the lines of the change mining paradigm. MEC includes a taxonomy of various types of clusters' transitions, a tracking mechanism that depends on cluster representation, and a transition detection algorithm. Our tracking mechanism can be subdivided in two methods, devised to monitor clusters' transitions: one based on graph transitions, and another based on clusters' overlap. To demonstrate the feasibility and applicability of MEC we present real world case studies, using datasets from different knowledge areas, such as Economy and Education.

FecharLer Abstract

1999

Linear tree

Autores
Gama, J; Brazdil, P;

Publicação
Intell. Data Anal.

Abstract

2008

Learning from Data Streams: Synopsis and Change Detection

Autores
Sebastiao, R; Gama, J; Mendonca, T;

Publicação
STAIRS 2008

Abstract
The aim of this PhD program is the study of algorithms for learning histograms, with the capacity of representing continuous high-speed flows of data and dealing with the current problem of change detection on data streams. In many modern applications, information is no longer gathered as finite stored data sets, but assuming the form of infinite data streams. As a large volume of information is produced at a high-speed rate it is no longer possible to use memory algorithms which require the full historic data stored in the main memory, so new ones are needed to process data online at the rate it is available. Moreover, the process generating data is not strictly stationary and evolves over time; so algorithms should, while extracting some sort of knowledge from this incessantly growing data, be able to adapt themselves to changes, maintaining a representation consistent with the most recent status of nature. In this work, we presented a feasible approach, using incremental histograms and monitoring data distributions, to detect concept drift in data stream context.

FecharLer Abstract

2008

Learning Model Trees from Data Streams

Autores
Ikonotnovska, E; Gama, J;

Publicação
DISCOVERY SCIENCE, PROCEEDINGS

Abstract
In this paper we propose a fast and incremental algorithm for learning model trees from data streams (FIMT) for regression problems. The algorithm is incremental, works online, processes examples once at the speed they arrive, and maintains an any-time regression model. The leaves contain linear-models trained online from the examples that fall at that leaf, a process with low complexity. The use of linear models in the leaves increases its any-time global performance. FIMT is able to obtain competitive accuracy with batch learners even for medium size datasets, but with better training time in an order of magnitude. We study the properties of FIMT over several artificial and real datasets and evaluate its sensitivity on the order of examples and the noise level.

FecharLer Abstract

2010

Knowledge Discovery from Sensor Data, Second International Workshop, Sensor-KDD 2008, Las Vegas, NV, USA, August 24-27, 2008, Revised Selected Papers

Autores
Gaber, MM; Vatsavai, RR; Omitaomu, OA; Gama, J; Chawla, NV; Ganguly, AR;

Publicação
KDD Workshop on Knowledge Discovery from Sensor Data

Abstract