Publicacoes - INESC TEC

Publicações

Publicações por Sérgio Nunes

2022

Designing User Interaction with Linked Data in Historical Archives

Autores
Guedes, C; Giesteira, B; Nunes, S;

Publicação
ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE

Abstract
In this article, we present solutions to visualize and interact with linked data in historical archives considering three different scenarios: search, individual record view, and creation of relationships. The created solutions were designed using nonfunctional mockups and were based on the CIDOC-CRM model, a model created and applied in the museums community liable to be extended to other cultural heritage institutions, being our solutions an application of this model to archives. A sample of 20 archival professionals was selected to evaluate the elements included in the proposed solutions. The evaluation sessions consisted in structured interviews supported by an introductory video and a survey. The think-aloud protocol was applied during the sessions. We conducted both a quantitative and qualitative analysis to the collected answers. From this analysis, we conclude that the majority of the participants showed great receptivity to the solutions presented and recognized many benefits in the application of linked data. Our contributions also include an exploratory study of some existing linked data systems, giving particular attention to their visualization and interaction modes.

FecharLer Abstract

2022

EPISA Platform: A Technical Infrastructure to Support Linked Data in Archival Management

Autores
Nunes, S; Silva, T; Martins, C; Peixoto, R;

Publicação
Proceedings of the 26th International Conference on Theory and Practice of Digital Libraries - Workshops and Doctoral Consortium, Padua, Italy, September 20, 2022.

Abstract
In this paper we describe the EPISA Platform, a technical infrastructure designed and developed to support archival records management and access using linked data technologies. The EPISA Platform follows a client-server paradigm, with a central component, the EPISA Server, responsible for storage, reasoning, authorization, and search; and a frontend component, the EPISA ArchClient, responsible for user interaction. The EPISA Server uses Apache Jena Fuseki for storage and reasoning, and Apache Solr for search. The EPISA ArchClient is a web application implemented using PHP Laravel and standard web technologies. The platform follows a modular architecture, based on Docker containers. We describe the technical details of the platform and the main user interaction workflows, highlighting the abstractions developed to integrate linked data in the archival management process. The EPISA Platform has been successfully used to support research and development of linked data use in the archival domain in the context of the EPISA project. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

FecharLer Abstract

2022

Federated Search Using Query Log Evidence

Autores
Damas, J; Devezas, J; Nunes, S;

Publicação
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2022

Abstract
In this work, we targeted the search engine of a sports-related website that presented an opportunity for search result quality improvement. We reframed the engine as a Federated Search instance, where each collection represented a searchable entity type within the system, using Apache Solr for querying each resource and a Python Flask server to merge results. We extend previous work on individual search term weighing, making use of past search terms as a relevance indicator for user selected documents. To incorporate term weights we define four strategies combining two binary variables: integration with default relevance (linear scaling or linear combination) and search term frequency (raw value or log-smoothed). To evaluate our solution, we extracted two query sets from search logs: one with frequently submitted queries, and another with ambiguous result access patterns. We used click-through information as a relevance proxy and tried to mitigate its limitations by evaluating under distinct IR metrics, including MRR, MAP and NDCG. Moreover, we also measured Spearman rank correlation coefficients to test similarities between produced rankings and reference orderings according to user access patterns. Results show consistency across all metrics in both sets. Previous search terms were key to obtaining a higher effectiveness, with runs that used pure search term frequency performing best. Compared to the baseline, our best strategies were able to maintain quality on frequent queries and improve retrieval effectiveness on ambiguous queries, with up to six percentage points better performance on most metrics.

FecharLer Abstract

2026

Cross-Lingual Information Retrieval in Tetun for Ad-Hoc Search

Autores
Araújo, A; de Jesus, G; Nunes, S;

Publicação
Lecture Notes in Computer Science

Abstract
Developing information retrieval (IR) systems that enable access across multiple languages is crucial in multilingual contexts. In Timor-Leste, where Tetun, Portuguese, English, and Indonesian are official and working languages, no cross-lingual information retrieval (CLIR) solutions currently exist to support information access across these languages. This study addresses that gap by investigating CLIR approaches tailored to the linguistic landscape of Timor-Leste. Leveraging an existing monolingual Tetun document collection and ad-hoc text retrieval baselines, we explore the feasibility of CLIR for Tetun. Queries were manually translated into Portuguese, English, and Indonesian to create a multilingual query set. These were then automatically translated back into Tetun using Google Translate and several large language models, and used to retrieve documents in Tetun. Results show that Google Translate is the most reliable tool for Tetun CLIR overall, and the Hiemstra LM consistently outperforms BM25 and DFR BM25 in cross-lingual retrieval performance. However, overall effectiveness remains up to 26.95% points lower than that of the monolingual baseline, underscoring the limitations of current translation tools and the challenges of developing an effective CLIR for Tetun. Despite these challenges, this work establishes the first CLIR baseline for Tetun ad-hoc text retrieval, providing a foundation for future research in this under-resourced setting. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

FecharLer Abstract

2026

User Behavior in Sports Search: Entity-Centric Query and Click Log Analysis

Autores
Damas, J; Nunes, S;

Publicação
Lecture Notes in Computer Science

Abstract
Understanding user behavior in search systems is essential for improving retrieval effectiveness and user satisfaction. While prior research has extensively examined general-purpose web search engines, domain-specific contexts—such as sports information—remain comparatively underexplored. In this study, we analyze over 400,000 interaction log entries from a sports-oriented search engine collected over a two-week period. Our analysis combines classic query-level metrics (e.g., frequency distributions, query lengths) with a detailed examination of click behavior, including entropy-based intent variability and a custom query quality scoring model. Compared to established baselines from general and specialized search environments, we observe a high proportion of new and single-term queries, as well as a notable lack of representativeness among top queries. These findings reveal patterns shaped by the event-driven and entity-centric nature of sports content, offering actionable insights for the design of domain-specific retrieval systems. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

FecharLer Abstract

2025

Evaluating Dense Model-based Approaches for Multimodal Medical Case Retrieval

Autores
Catarina Pires; Sérgio Nunes; Luís Filipe Teixeira;

Publicação
Information Retrieval Research

Abstract
Medical case retrieval plays a crucial role in clinical decision-making by enabling healthcare professionals to find relevant cases based on patient records, diagnostic images, and textual descriptions. Given the inherently multimodal nature of medical data, effective retrieval requires models that can bridge the gap between different modalities. Traditional retrieval approaches often rely on unimodal representations, limiting their ability to capture cross-modal relationships. Recent advances in dense model-based techniques have shown promise in overcoming these limitations by encoding multimodal information into a shared latent space, facilitating retrieval based on semantic similarity. This paper investigates the potential of dense models to enhance multimodal search systems. We evaluate various dense model-based approaches to assess which model characteristics have the greatest impact on retrieval effectiveness, using the medical case-based retrieval task from ImageCLEFmed 2013 as a benchmark. Our findings indicate that different dense model approaches substantially impact retrieval effectiveness, and that applying the CombMAX fusion methodto combine their output results further improves effectiveness. Extending context length, however, yielded mixed results depending on the input data. Additionally, domain-specific models—those trained on medical data—outperformed general models trained on broad, non-specialized datasets within their respective fields. Furthermore, when text is the dominant information source, text-only models surpassed multimodal models

FecharLer Abstract