Publications

Publications by Sérgio Nunes

2016

Exploring a Large News Collection Using Visualization Tools

Authors
Devezas, T; Devezas, JL; Nunes, S;

Publication
Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval (ECIR 2016), Padua, Italy, March 20, 2016.

Abstract
The overwhelming amount of news content published online every day has made it increasingly difficult to perform macro-level analysis of the news landscape. Visual exploration tools harness both computing power and human perception to assist in making sense of large data collections. In this paper, we employed three visualization tools to explore a dataset comprising one million articles published by news organizations and blogs. The visual analysis of the dataset revealed that 1) news and blog sources evaluate very differently the importance of similar events, granting them distinct amounts of coverage, 2) there are both dissimilarities and overlaps in the publication patterns of the two source types, and 3) the content's direction and diversity behave differently over time. Copyright © 2016 for the individual papers by the paper's authors.

CloseRead Abstract

2016

Predicting the comprehension of health web documents using characteristics of documents and users

Authors
Oroszlanyova, M; Lopes, CT; Nunes, S; Ribeiro, C;

Publication
INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS/INTERNATIONAL CONFERENCE ON PROJECT MANAGEMENT/INTERNATIONAL CONFERENCE ON HEALTH AND SOCIAL CARE INFORMATION SYSTEMS AND TECHNOLOGIES, CENTERIS/PROJMAN / HCIST 2016

Abstract
The Web is frequently used as a way to access health information. In the health domain, the terminology can be very specific, frequently assuming a medico-scientific character. This can be a barrier to users who may be unable to understand the retrieved documents. Therefore, it would be useful to automatically assess how well a certain document will be understood by a certain user. In the present work, we analyse whether it is possible to predict the comprehension of documents using document features together with user features, and how well this can be achieved. We use an existing dataset, composed by health documents on the Web and their assessment in terms of comprehension by users, to build two multivariate prediction models for comprehension. Our best model showed very good results, with 96.51% accuracy. Our findings suggest features that can be considered by search engines to estimate comprehension. We found that user characteristics related to web and health search habits, such as the success of the users with Web search and the frequency of the users' health search, are some of the most influential user variables. The promising results obtained with this dataset with manual comprehension assessment will lead us to explore the automatic assessment of document and user characteristics. (C) 2016 The Authors. Published by Elsevier B.V.

CloseRead Abstract

2015

Summarization of changes in dynamic text collections using Latent Dirichlet Allocation model

Authors
Kar, M; Nunes, S; Ribeiro, C;

Publication
INFORMATION PROCESSING & MANAGEMENT

Abstract
In the area of Information Retrieval, the task of automatic text summarization usually assumes a static underlying collection of documents, disregarding the temporal dimension of each document. However, in real world settings, collections and individual documents rarely stay unchanged over time. The World Wide Web is a prime example of a collection where information changes both frequently and significantly over time, with documents being added, modified or just deleted at different times. In this context, previous work addressing the summarization of web documents has simply discarded the dynamic nature of the web, considering only the latest published version of each individual document. This paper proposes and addresses a new challenge - the automatic summarization of changes in dynamic text collections. In standard text summarization, retrieval techniques present a summary to the user by capturing the major points expressed in the most recent version of an entire document in a condensed form. In this new task, the goal is to obtain a summary that describes the most significant changes made to a document during a given period. In other words, the idea is to have a summary of the revisions made to a document over a specific period of time. This paper proposes different approaches to generate summaries using extractive summarization techniques. First, individual terms are scored and then this information is used to rank and select sentences to produce the final summary. A system based on Latent Dirichlet Allocation model (LDA) is used to find the hidden topic structures of changes. The purpose of using the LDA model is to identify separate topics where the changed terms from each topic are likely to carry at least one significant change. The different approaches are then compared with the previous work in this area. A collection of articles from Wikipedia, including their revision history, is used to evaluate the proposed system. For each article, a temporal interval and a reference summary from the article's content are selected manually. The articles and intervals in which a significant event occurred are carefully selected. The summaries produced by each of the approaches are evaluated comparatively to the manual summaries using ROUGE metrics. It is observed that the approach using the LDA model outperforms all the other approaches. Statistical tests reveal that the differences in ROUGE scores for the LDA-based approach is statistically significant at 99% over baseline.

CloseRead Abstract

2017

Predicting the Situational Relevance of Health Web Documents

Authors
Oroszlanyova, M; Lopes, CT; Nunes, S; Ribeiro, C;

Publication
2017 12TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI)

Abstract
Relevance is usually estimated by search engines using document content, disregarding the user behind the search and the characteristics of the task. In this work, we look at relevance as framed in a situational context, calling it situational relevance, and analyze if it is possible to predict it using documents, users and tasks characteristics. Using an existing dataset composed of health web documents, relevance judgments for information needs, user and task characteristics, we build a multivariate prediction model for situational relevance. Our model has an accuracy of 77.17%. Our findings provide insights into features that could improve the estimation of relevance by search engines, helping to conciliate the systemic and situational views of relevance. In a near future we will work on the automatic assessment of document, user and task characteristics.

CloseRead Abstract

2017

Evaluation of Stanford NER for Extraction of Assembly Information from Instruction Manuals

Authors
Costa, CM; Veiga, G; Sousa, A; Nunes, S;

Publication
2017 IEEE INTERNATIONAL CONFERENCE ON AUTONOMOUS ROBOT SYSTEMS AND COMPETITIONS (ICARSC)

Abstract
Teaching industrial robots by demonstration can significantly decrease the repurposing costs of assembly lines worldwide. To achieve this goal, the robot needs to detect and track each component with high accuracy. To speedup the initial object recognition phase, the learning system can gather information from assembly manuals in order to identify which parts and tools are required for assembling a new product (avoiding exhaustive search in a large model database) and if possible also extract the assembly order and spatial relation between them. This paper presents a detailed analysis of the fine tuning of the Stanford Named Entity Recognizer for this text tagging task. Starting from the recommended configuration, it was performed 91 tests targeting the main features / parameters. Each test only changed a single parameter in relation to the recommend configuration, and its goal was to see the impact of the new configuration in the precision, recall and F1 metrics. This analysis allowed to fine tune the Stanford NER system, achieving a precision of 89.91%, recall of 83.51% and F1 of 84.69%. These results were retrieved with our new manually annotated dataset containing text with assembly operations for alternators, gearboxes and engines, which were written in a language discourse that ranges from professional to informal. The dataset can also be used to evaluate other information extraction and computer vision systems, since most assembly operations have pictures and diagrams showing the necessary product parts, their assembly order and relative spatial disposition. © 2017 IEEE.

CloseRead Abstract

2016

Index-Based Semantic Tagging for Efficient Query Interpretation

Authors
Devezas, J; Nunes, S;

Publication
EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION, CLEF 2016

Abstract
Modern search engines are evolving beyond ad hoc document retrieval. Nowadays, the information needs of the users can be directly satisfied through entity-oriented search, by ranking the entities or attributes that better relate to the query, as opposed to the documents that contain the best matching terms. One of the challenges in entity-oriented search is efficient query interpretation. In particular, the task of semantic tagging, for the identification of entity types in query parts, is central to understanding user intent. We compare two approaches for semantic tagging, within a single domain, one based on a Sesame triple store and another one based on a Lucene index. This provides a segmentation and annotation of the query based on the most probable entity types, leading to query classification and its subsequent interpretation. We evaluate the run time performance for the two strategies and find that there is a statistically significant speedup, of at least four times, for the index-based strategy over the triple store strategy.

CloseRead Abstract