Publications

Publications by LIAAD

2007

Data stream processing

Authors
Gama, J; Rodrigues, PP;

Publication
Learning from Data Streams: Processing Techniques in Sensor Networks

Abstract
The rapid growth in information science and technology in general and the complexity and volume of data in particular have introduced new challenges for the research community.Many sources produce data continuously. Examples include sensor networks, wireless networks, radio frequency identification (RFID), customer click streams, telephone records, multimedia data, scientific data, sets of retail chain transactions etc. These sources are called data streams. A data stream is an ordered sequence of instances that can be read only once or a small number of times using limited computing and storage capabilities. These sources of data are characterized by being open-ended, flowing at high-speed, and generated by non stationary distributions in dynamic environments. What distinguishes current data from earlier one is automatic data feeds. We do not just have people who are entering information into a computer. Instead, we have computers entering data into each other [25]. Nowadays there are applications in which the data are modeled best as transient data streams instead of as persistent tables. Examples of applications include network monitoring, user modeling in web applications, sensor networks in electrical networks, telecommunications data management, prediction in stock markets, monitoring radio frequency identification etc. In these applications it is not feasible to load the arriving data into a traditional data base management system (DBMS) and traditional DBMS are not designed to directly support the continuous queries required by these applications [3]. Carney et al. [6] pointed out the significant differences between data bases that are passive repositories of data and data bases that actually monitor applications and alert humans when abnormal activity is detected. In the former, only the current state of the data is relevant for analysis. Humans initiate queries, usually one-time, predefined queries. In the latter, data come from external sources (e.g., sensors), and require processing historic data. For example, in monitoring activity, queries should run continuously. The answer to a continuous query is produced over time, reflecting the data seen so far. Moreover, if the process is not strictly stationary (as most of real-world applications), the target concept could gradually change over time. For example, the type of abnormal activity (e.g., attacks in TCP/IP networks, frauds in credit card transactions etc.) changes over time. Organizations use decision support systems to identify potential useful patterns in data. Data analysis is complex, interactive, and exploratory over very large volumes of historic data, eventually stored in distributed environments. Traditional pattern discovery process requires online ad-hoc queries, not previously defined, that are successively refined. Nowadays, given the current trends in decision support and data analysis, the computer plays a much more active role, by searching hypotheses, evaluating and suggesting patterns. Due to the exploratory nature of these queries, an exact answer may not be required. A user may prefer a fast approximate answer. Range queries and selectivity estimation (the proportion of tuples that satisfy a query) are two illustrative examples where fast but approximate answers are more useful than slow and exact ones. Sensor networks are distributed environments producing multiple streams of data. We can consider the network as a distributed database we are interested in querying and mining. In this chapter we review the main techniques used for query and mining data streams that are of potential use in sensor networks. In Sect. 3.2 we refer to the data stream models and identify its main research challenges. Section 3.3 presents basic stream models. Section 3.4 present basic stream algorithms for maintaining synopsis over data streams. Section 3.5 concludes the chapter and points out future directions for research. © 2007 Springer-Verlag Berlin Heidelberg.

CloseRead Abstract

2007

Efficient and scalable induction of logic programs using a deductive database system

Authors
Ferreira, M; Fonseca, NA; Rocha, R; Scares, T;

Publication
Inductive Logic Programming

Abstract
A consequence of ILP systems being implemented in Prolog or using Prolog libraries is that, usually, these systems use a Prolog internal database to store and manipulate data. However, in real-world problems, the original data is rarely in Prolog format. In fact, the data is often kept in Relational Database Management Systems (RDBMS) and then converted to a format acceptable by the ILP system. Therefore, a more interesting approach is to link the ILP system to the RDBMS and manipulate the data without converting it. This scheme has the advantage of being more scalable since the whole data does not need to be loaded into memory by the ILP system. In this paper we study several approaches of coupling ILP systems with RDBMS systems and evaluate their impact on performance. We propose to use a Deductive Database (DDB) system to transparently translate the hypotheses to relational algebra expressions. The empirical evaluation performed shows that the execution time of ILP algorithms can be effectively reduced using a DDB and that the size of the problems can be increased due to a non-memory storage of the data.

CloseRead Abstract

2007

Learning paraphrases from WNS corpora

Authors
Cordeiro, J; Dias, G; Brazdil, P;

Publication
Proceedings of the Twentieth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2007

Abstract
Paraphrase detection can be seen as the task of aligning sentences that convey the same information but yet are written in different forms. Such resources are important to automatically learn text-to-text rewriting rules. In this paper, we present a new metric for unsupervised detection of paraphrases and apply it in the context of clustering of paraphrases. An exhaustive evaluation is conducted over a set of standard paraphrase corpora and real-world web news stories (WNS) corpora. The results are promising as they outperform state-of-the-art measures developed for similar tasks. Copyright

CloseRead Abstract

2007

A Metric for Paraphrase Detection

Authors
Cordeiro, J; Dias, G; Brazdil, P;

Publication
2007 International Multi-Conference on Computing in the Global Information Technology (ICCGI'07)

Abstract

2007

New Functions for Unsupervised Asymmetrical Paraphrase Detection

Authors
Cordeiro, J; Dias, G; Brazdil, P;

Publication
JSW

Abstract

2007

An iterative process for building learning curves and predicting relative performance of classifiers

Authors
Leite, R; Brazdil, P;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS

Abstract
This paper concerns the problem of predicting the relative performance of classification algorithms. Our approach requires that experiments are conducted on small samples. The information gathered is used to identify the nearest learning curve for which the sampling procedure was fully carried out. This allows the generation of a prediction regarding the relative performance of the algorithms. The method automatically establishes how many samples are needed and their sizes. This is done iteratively by taking into account the results of all previous experiments - both on other datasets and on the new dataset obtained so far. Experimental evaluation has shown that the method achieves better performance than previous approaches.

CloseRead Abstract