Publicacoes - INESC TEC

Publicações

Publicações por João Gama

2012

A framework to monitor clusters evolution applied to economy and finance problems

Autores
Oliveira, M; Gama, J;

Publicação
INTELLIGENT DATA ANALYSIS

Abstract
The study of evolution has become an important research issue, especially in the last decade, due to our ability to collect and store high detailed and time-stamped data. The need for describing and understanding the behavior of a given phenomena over time led to the emergence of new frameworks and methods focused on the temporal evolution of data and models. In this paper we address the problem of monitoring the evolution of clusters over time and propose the MEC framework. MEC traces evolution through the detection and categorization of clusters transitions, such as births, deaths and merges, and enables their visualization through bipartite graphs. It includes a taxonomy of transitions, a tracking method based in the computation of conditional probabilities, and a transition detection algorithm. We use MEC with two main goals: to determine the general evolution trends and to detect abnormal behavior or rare events. To demonstrate the applicability of our framework we present real world economic and financial case studies, using datasets extracted from Banco de Portugal Central Balance-Sheet Database and the The Data Page of New York University -Leonard N. Stern School of Business. The results allow us to draw interesting conclusions about the evolution of activity sectors and European companies.

FecharLer Abstract

2009

Novelty detection with application to data streams

Autores
Spinosa, EJ; de Carvalhoa, APDF; Gama, J;

Publicação
INTELLIGENT DATA ANALYSIS

Abstract
This paper presents and evaluates an approach to novelty detection that addresses it as the problem of identifying novel concepts in a continuous learning scenario, as an extension to a single-class classification problem. OLINDDA, an OnLIne Novelty and Drift Detection Algorithm that implements this approach, uses efficient standard clustering algorithms to continuously generate candidate clusters among examples that were not explained by the current known concepts. Clusters complying with a validation criterion that takes cohesiveness and representativeness into account are initially identified as concepts. By merging similar concepts, OLINDDA may enhance the representation of some concepts as it advances toward its final goal of describing novel emerging concepts in an unsupervised way. The proposed approach is experimentally evaluated by the use of several measures taken throughout the learning process. Results show that it is capable of identifying novel concepts that are pure and correspond to real classes, disregarding unrepresentative clusters and outliers.

FecharLer Abstract

2011

Clustering distributed sensor data streams using local processing and reduced communication

Autores
Gama, J; Rodrigues, PP; Lopes, L;

Publicação
INTELLIGENT DATA ANALYSIS

Abstract
Nowadays applications produce infinite streams of data distributed across wide sensor networks. In this work we study the problem of continuously maintain a cluster structure over the data points generated by the entire network. Usual techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream. In this paper, we propose DGClust, a new distributed algorithm which reduces both the dimensionality and the communication burdens, by allowing each local sensor to keep an online discretization of its data stream, which operates with constant update time and (almost) fixed space. Each new data point triggers a cell in this univariate grid, reflecting the current state of the data stream at the local site. Whenever a local site changes its state, it notifies the central server about the new state it is in. This way, at each point in time, the central site has the global multivariate state of the entire network. To avoid monitoring all possible states, which is exponential in the number of sensors, the central site keeps a small list of counters of the most frequent global states. Finally, a simple adaptive partitional clustering algorithm is applied to the frequent states central points in order to provide an anytime definition of the clusters centers. The approach is evaluated in the context of distributed sensor networks, focusing on three outcomes: loss to real centroids, communication prevention, and processing reduction. The experimental work on synthetic data supports our proposal, presenting robustness to a high number of sensors, and the application to real data from physiological sensors exposes the aforementioned advantages of the system.

FecharLer Abstract

2009

A system for analysis and prediction of electricity-load streams

Autores
Rodrigues, PP; Gama, J;

Publicação
INTELLIGENT DATA ANALYSIS

Abstract
Sensors distributed all around electrical-power distribution networks produce streams of data at high-speed. From a data mining perspective, this sensor network problem is characterized by a large number of variables ( sensors), producing a continuous flow of data, in a dynamic non-stationary environment. Companies make decisions to buy or sell energy based on load profiles and forecast. In this work we analyze the most relevant data mining problems and issues: continuously learning clusters and predictive models, model adaptation in large domains, and change detection and adaptation. The goal is to continuously maintain a clustering model, defining profiles, and a predictive model able to incorporate new information at the speed data arrives, detecting changes and adapting the decision models to the most recent information. We present experimental results in a large real-world scenario, illustrating the advantages of the continuous learning and its competitiveness against Wavelets based prediction. We also propose a light electrical load visualization system which enhances the ability to inspect forecast results in mobile devices.

FecharLer Abstract

2008

Robust Division in Clustering of Streaming Time Series

Autores
Rodrigues, PP; Gama, J;

Publicação
ECAI 2008, PROCEEDINGS

Abstract
Online learning algorithms which address fast data streams should process examples at the rate they arrive, using a single scan of data and fixed memory, maintaining a decision model at any time and being able to adapt the model to the most recent data. These features yield the necessity of using approximate models. One problem that usually arises with approximate models is the definition of a minimum number of observations necessary to assure convergence, which implies a high risk since the system may have to decide based only on a small subset of the entire data. One approach is to apply techniques based on the Hoeffding bound to enforce decisions with a confidence level. In divisive clustering of time series, the goal is to find clusters of similar time series over time. In online approaches there are two decisions to make: when to split and how to assign variables to new clusters. We can define a confidence level to both the decision of splitting and the assignment of data variables to new clusters. Previous works have already addressed confident decisions on the moment of split. Our proposal is to include a confidence level to the assignment process. When a split point is reported, creating two new clusters, we can directly assign points which are confidently closer to one cluster than the other, having a different strategy for those variables which do not satisfy the confidence level. In this paper we propose to assign the unsure variables to a third cluster. Experimental evaluation is presented in the context of a recently proposed hierarchical algorithm, assessing the advantages of the proposal, revealing also advantages on memory usage reduction and processing speed. Although this proposal is evaluated under the scope of an existent method, it can be generalized to any divisive procedure.

FecharLer Abstract

2011

Data Mining Applied on Grain Data Mart

Autores
Correa, FE; Oliveira, MDB; Alves, LRA; Gama, J; Correa, PLP;

Publicação
EFITA/WCCA '11

Abstract
Agribusiness, as many other activities, produces huge amounts of spatio-temporal data. We need a system in order to store, analyze, and mine this data. In a previous work, we developed data warehouse tools to store, organize and query Brazilian agribusiness data from several regions along 10 years. In this paper, we go a step ahead, and propose specific data mining techniques to discover marks and evolution patterns from Agribusiness data. We propose the use of Tucker decomposition to automatically detect short time windows that exhibit large changes in the correlation structure between the time-series of prices from the Brazil Grain market.

FecharLer Abstract