Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Interest
Topics
Details

Details

  • Name

    Hugo Oliveira Sousa
  • Role

    External Student
  • Since

    07th December 2020
004
Publications

2025

Don't Forget This: Augmenting Results with Event-Aware Search

Authors
Sousa, H; Ward, AR; Alonso, O;

Publication
PROCEEDINGS OF THE EIGHTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2025

Abstract
Events like Valentine's Day and Christmas can influence user intent when interacting with search engines. For example, a user searching for gift around Valentine's Day is likely to be looking for Valentine's-themed options, whereas the same query close to Christmas would more likely suggest an interest in Holiday-themed gifts. These shifts in user intent, driven by temporal factors, are often implicit but important to determine the relevance of search results. In this demo, we explore how incorporating temporal awareness can enhance search relevance in an e-commerce setting. We constructed a database of 2K events and, using historical purchase data, developed a temporal model that estimates each event's importance on a specific date. The most relevant events on the date the query was issued are then used to enrich search results with event-specific items. Our demo illustrates how this approach enables a search system to better adapt to temporal nuances, ultimately delivering more contextually relevant products.

2024

Text2Story Lusa: A Dataset for Narrative Analysis in European Portuguese News Articles

Authors
Nunes, S; Jorge, AM; Amorim, E; Sousa, HO; Leal, A; Silvano, PM; Cantante, I; Campos, R;

Publication
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy.

Abstract
Narratives have been the subject of extensive research across various scientific fields such as linguistics and computer science. However, the scarcity of freely available datasets, essential for studying this genre, remains a significant obstacle. Furthermore, datasets annotated with narratives components and their morphosyntactic and semantic information are even scarcer. To address this gap, we developed the Text2Story Lusa datasets, which consist of a collection of news articles in European Portuguese. The first datasets consists of 357 news articles and the second dataset comprises a subset of 117 manually densely annotated articles, totaling over 50 thousand individual annotations. By focusing on texts with substantial narrative elements, we aim to provide a valuable resource for studying narrative structures in European Portuguese news articles. On the one hand, the first dataset provides researchers with data to study narratives from various perspectives. On the other hand, the annotated dataset facilitates research in information extraction and related tasks, particularly in the context of narrative extraction pipelines. Both datasets are made available adhering to FAIR principles, thereby enhancing their utility within the research community.

2024

<i>Physio</i>: An LLM-Based Physiotherapy Advisor

Authors
Almeida, R; Sousa, H; Cunha, LF; Guimaraes, N; Campos, R; Jorge, A;

Publication
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT V

Abstract
The capabilities of the most recent language models have increased the interest in integrating them into real-world applications. However, the fact that these models generate plausible, yet incorrect text poses a constraint when considering their use in several domains. Healthcare is a prime example of a domain where text-generative trustworthiness is a hard requirement to safeguard patient well-being. In this paper, we present Physio, a chat-based application for physical rehabilitation. Physio is capable of making an initial diagnosis while citing reliable health sources to support the information provided. Furthermore, drawing upon external knowledge databases, Physio can recommend rehabilitation exercises and over-the-counter medication for symptom relief. By combining these features, Physio can leverage the power of generative models for language processing while also conditioning its response on dependable and verifiable sources. A live demo of Physio is available at https://physio.inesctec.pt.

2023

GPT Struct Me: Probing GPT Models on Narrative Entity Extraction

Authors
Sousa, H; Guimaraes, N; Jorge, A; Campos, R;

Publication
2023 IEEE INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WI-IAT

Abstract
The importance of systems that can extract structured information from textual data becomes increasingly pronounced given the ever-increasing volume of text produced on a daily basis. Having a system that can effectively extract such information in an interoperable manner would be an asset for several domains, be it finance, health, or legal. Recent developments in natural language processing led to the production of powerful language models that can, to some degree, mimic human intelligence. Such effectiveness raises a pertinent question: Can these models be leveraged for the extraction of structured information? In this work, we address this question by evaluating the capabilities of two state-of-the-art language models - GPT-3 and GPT-3.5, commonly known as ChatGPT - in the extraction of narrative entities, namely events, participants, and temporal expressions. This study is conducted on the Text2Story Lusa dataset, a collection of 119 Portuguese news articles whose annotation framework includes a set of entity structures along with several tags and attribute values. We first select the best prompt template through an ablation study over prompt components that provide varying degrees of information on a subset of documents of the dataset. Subsequently, we use the best templates to evaluate the effectiveness of the models on the remaining documents. The results obtained indicate that GPT models are competitive with out-of-the-box baseline systems, presenting an all-in-one alternative for practitioners with limited resources. By studying the strengths and limitations of these models in the context of information extraction, we offer insights that can guide future improvements and avenues to explore in this field.

2023

TEI2GO: A Multilingual Approach for Fast Temporal Expression Identification

Authors
Sousa, H; Campos, R; Jorge, A;

Publication
PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023

Abstract
Temporal expression identification is crucial for understanding texts written in natural language. Although highly effective systems such as HeidelTime exist, their limited runtime performance hampers adoption in large-scale applications and production environments. In this paper, we introduce the TEI2GO models, matching HeidelTime's effectiveness but with significantly improved runtime, supporting six languages, and achieving state-of-the-art results in four of them. To train the TEI2GO models, we used a combination of manually annotated reference corpus and developed Professor HeidelTime, a comprehensive weakly labeled corpus of news texts annotated with HeidelTime. This corpus comprises a total of 138, 069 documents (over six languages) with 1, 050, 921 temporal expressions, the largest open-source annotated dataset for temporal expression identification to date. By describing how the models were produced, we aim to encourage the research community to further explore, refine, and extend the set of models to additional languages and domains. Code, annotations, and models are openly available for community exploration and use. The models are conveniently on HuggingFace for seamless integration and application.