Publicacoes - INESC TEC

Publicações

Publicações por José Luís Devezas

2019

Hypergraph-of-entity A unified representation model for the retrieval of text and knowledge

Autores
Devezas, J; Nunes, S;

Publicação
OPEN COMPUTER SCIENCE

Abstract
Modern search is heavily powered by knowledge bases, but users still query using keywords or natural language. As search becomes increasingly dependent on the integration of text and knowledge, novel approaches for a unified representation of combined data present the opportunity to unlock new ranking strategies. We have previously proposed the graph-of-entity as a purely graph-based representation and retrieval model, however this model would scale poorly. We tackle the scalability issue by adapting the model so that it can be represented as a hypergraph. This enables a significant reduction of the number of (hyper)edges, in regard to the number of nodes, while nearly capturing the same amount of information. Moreover, such a higher-order data structure, presents the ability to capture richer types of relations, including nary connections such as synonymy, or subsumption. We present the hypergraph-of-entity as the next step in the graph-of-entity model, where we explore a ranking approach based on biased random walks. We evaluate the approaches using a subset of the INEX 2009 Wikipedia Collection. While performance is still below the state of the art, we were, in part, able to achieve a MAP score similar to TF-IDF and greatly improve indexing efficiency over the graph-of-entity.

FecharLer Abstract

2019

Graph-of-Entity: A Model for Combined Data Representation and Retrieval

Autores
Devezas, JL; Lopes, CT; Nunes, S;

Publicação
8th Symposium on Languages, Applications and Technologies, SLATE 2019, June 27-28, 2019, Coimbra, Portugal.

Abstract
Managing large volumes of digital documents along with the information they contain, or are associated with, can be challenging. As systems become more intelligent, it increasingly makes sense to power retrieval through all available data, where every lead makes it easier to reach relevant documents or entities. Modern search is heavily powered by structured knowledge, but users still query using keywords or, at the very best, telegraphic natural language. As search becomes increasingly dependent on the integration of text and knowledge, novel approaches for a unified representation of combined data present the opportunity to unlock new ranking strategies. We tackle entity-oriented search using graph-based approaches for representation and retrieval. In particular, we propose the graph-of-entity, a novel approach for indexing combined data, where terms, entities and their relations are jointly represented. We compare the graph-of-entity with the graph-of-word, a text-only model, verifying that, overall, it does not yet achieve a better performance, despite obtaining a higher precision. Our assessment was based on a small subset of the INEX 2009 Wikipedia Collection, created from a sample of 10 topics and respectively judged documents. The offline evaluation we do here is complementary to its counterpart from TREC 2017 OpenSearch track, where, during our participation, we had assessed graph-of-entity in an online setting, through team-draft interleaving. © José Devezas, Carla Lopes, and Sérgio Nunes.

FecharLer Abstract

2020

Army ANT: A Workbench for Innovation in Entity-Oriented Search

Autores
Devezas, JL; Nunes, S;

Publicação
Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part II

Abstract
As entity-oriented search takes the lead in modern search, the need for increasingly flexible tools, capable of motivating innovation in information retrieval research, also becomes more evident. Army ANT is an open source framework that takes a step forward in generalizing information retrieval research, so that modern approaches can be easily integrated in a shared evaluation environment. We present an overview on the system architecture of Army ANT, which has four main abstractions: (i) readers, to iterate over text collections, potentially containing associated entities and triples; (ii) engines, that implement indexing and searching approaches, supporting different retrieval tasks and ranking functions; (iii) databases, to store additional document metadata; and (iv) evaluators, to assess retrieval performance for specific tasks and test collections. We also introduce the command line interface and the web interface, presenting a learn mode as a way to explore, analyze and understand representation and retrieval models, through tracing, score component visualization and documentation. © Springer Nature Switzerland AG 2020.

FecharLer Abstract

2019

Characterizing the Hypergraph-of-Entity Representation Model

Autores
Devezas, JL; Nunes, S;

Publicação
Complex Networks and Their Applications VIII - Volume 2 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019, Lisbon, Portugal, December 10-12, 2019.

Abstract
The hypergraph-of-entity is a joint representation model for terms, entities and their relations, used as an indexing approach in entity-oriented search. In this work, we characterize the structure of the hypergraph, from a microscopic and macroscopic scale, as well as over time with an increasing number of documents. We use a random walk based approach to estimate shortest distances and node sampling to estimate clustering coefficients. We also propose the calculation of a general mixed hypergraph density based on the corresponding bipartite mixed graph. We analyze these statistics for the hypergraph-of-entity, finding that hyperedge-based node degrees are distributed as a power law, while node-based node degrees and hyperedge cardinalities are log-normally distributed. We also find that most statistics tend to converge after an initial period of accentuated growth in the number of documents. © 2020, Springer Nature Switzerland AG.

FecharLer Abstract

2012

Creating News Context From a Folksonomy of Web Clipping

Autores
Devezas, J; Alves, H; Figueira, A;

Publicação
INTERNATIONAL MULTICONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS, IMECS 2012, VOL I

Abstract
We propose a method for creating news context by taking advantage of a folksonomy of web clipping based on online news. We experiment with an ontology-based named entity recognition process and study two different ways of modeling the relationships induced by the coreference of named entities on news clips. We try to establish a context by identifying the community structure for a clip-centric network and for an entity-centric network, based on a small test set from the Breadcrumbs system. Finally, we compare both models, based on the detected news communities, and show the advantages of each network specification.

FecharLer Abstract

2012

Using the overlapping community structure of a network of tags to improve text clustering

Autores
Cravino, N; Devezas, JL; Figueira, A;

Publicação
23rd ACM Conference on Hypertext and Social Media, HT '12, Milwaukee, WI, USA, June 25-28, 2012

Abstract
Breadcrumbs is a folksonomy of news clips, where users can aggregate fragments of text taken from online news. Besides the textual content, each news clip contains a set of metadata fields associated with it. User-defined tags are one of the most important of those information fields. Based on a small data set of news clips, we build a network of cooccurrence of tags in news clips, and use it to improve text clustering. We do this by defining a weighted cosine similarity proximity measure that takes into account both the clip vectors and the tag vectors. The tag weight is computed using the related tags that are present in the discovered community. We then use the resulting vectors together with the new distance metric, which allows us to identify socially biased document clusters. Our study indicates that using the structural features of the network of tags leads to a positive impact in the clustering process. Copyright 2012 ACM.

FecharLer Abstract