Publications

Publications by Gabriel David

2010

Term Frequency Dynamics in Collaborative Articles

Authors
Nunes, S; Ribeiro, C; David, G;

Publication
DOCENG2010: PROCEEDINGS OF THE 2010 ACM SYMPOSIUM ON DOCUMENT ENGINEERING

Abstract
Documents on the World Wide Web are dynamic entities. Mainstream information retrieval systems and techniques are primarily focused on the latest version a document, generally ignoring its evolution over time. In this work, we study the term frequency dynamics in web documents over their lifespan. We use the Wikipedia as a document collection because it is a broad and public resource and, more important, because it provides access to the complete revision history of each document. We investigate the progression of similarity values over two projection variables, namely revision order and revision date. Based on this investigation we find that term frequency in encyclopedic documents - i.e. comprehensive and focused on a single topic - exhibits a rapid and steady progression towards the document's current version. The content in early versions quickly becomes very similar to the present version of the document.

CloseRead Abstract

2007

Using neighbors to date web documents

Authors
Nunes, S; Ribeiro, C; David, G;

Publication
International Conference on Information and Knowledge Management, Proceedings

Abstract
Time has been successfully used as a feature in web information retrieval tasks. In this context, estimating a document's inception date or last update date is a necessary task. Classic approaches have used HTTP header fields to estimate a document's last update time. The main problem with this approach is that it is applicable to a small part of web documents. In this work, we evaluate an alternative strategy based on a document's neighborhood. Using a random sample containing 10,000 URLs from the Yahoo! Directory, we study each document's links and media assets to determine its age. If we only consider isolated documents, we are able to date 52% of them. Including the document's neighborhood, we are able to estimate the date of more than 86% of the same sample. Also, we find that estimates differ significantly according to the type of neighbors used. The most reliable estimates are based on the document's media assets, while the worst estimates are based on incoming links. These results are experimentally evaluated with a real world application using different datasets. Copyright 2007 ACM.

CloseRead Abstract

2007

An evaluation framework for multidimensional multimedia Descriptor indexing

Authors
Gonalves, B; Calistru, C; Ribeiro, C; David, G;

Publication
2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, VOLS 1-2

Abstract
Automatic multimedia retrieval requires the use of complex features, which are typically captured by multidimensional descriptors. A basic operation in a multimedia retrieval system is similarity computation, making use of descriptor-dependant metrics. Many data structures have been proposed for managing the representation of multidimensional descriptors, each geared towards efficiency in some set of basic operations. The paper describes a framework for evaluating multidimensional descriptor indexing structures and reports a set of experiments with selected descriptors indexing methods. The extensibility of the framework is illustrated by incorporating a recently-proposed structure, the BitMatrix. Data sets and experiment conditions can be set up so as to provide results that can be used in the choice of appropriate indexing structures for a class of multimedia retrieval applications.

CloseRead Abstract

2012

SIARD Archive Browser

Authors
Rahman, AU; David, G; Ribeiro, C;

Publication
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
SIARD Suite enables us to preserve a relational database in an open format. It migrates a relational database to SIARD format and preserves technical and contextual metadata along with the primary data ensuring long term accessibility. This paper introduces a web application, the SIARD Archive Browser, which allows operations on the archive such as searching for a specific record, counting records in a table containing a keyword, sorting by a column and making joins. In many use cases, the application avoids the need to load a preserved database to a DBMS. © 2012 Springer-Verlag.

CloseRead Abstract

2010

Model Migration Approach for Database Preservation

Authors
Rahman, AU; David, G; Ribeiro, C;

Publication
ROLE OF DIGITAL LIBRARIES IN A TIME OF GLOBAL CHANGE

Abstract
Strategies developed for database preservation in the past include technology preservation, migration, emulation and the use of a universal virtual computer. In this paper we present a new concept of "Model Migration for Database Preservation". Our proposed approach involves two major activities. First, migrating the database model from conventional relational model to dimensional model and second, calculating the information embedded in code and preserving it instead of preserving the code required to calculate it. This will affect the originality of the database but improve two other characteristics: the information considered relevant is kept in a simple and easier to understand format and the systematic process to preserve the dimensional model is independent of the DBMS details and application logic.

CloseRead Abstract

2008

Use of temporal expressions in web search

Authors
Nunes, S; Ribeiro, C; David, G;

Publication
ADVANCES IN INFORMATION RETRIEVAL

Abstract
While trying to understand and characterize users' behavior online, the temporal dimension has received little attention by the research community. This exploratory study uses two collections of web search queries to investigate the use of temporal information needs. Using state-of-the-art information extraction techniques we identify temporal expressions in these queries. We find that temporal expressions are rarely used (1.5% of queries) and, when used, they are related to current and past events. Also, there are specific topics where the use of temporal expressions is more visible.

CloseRead Abstract