Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by LIAAD

2025

Enhancing Portuguese Variety Identification with Cross-Domain Approaches

Authors
Sousa, H; Almeida, R; Silvano, P; Cantante, I; Campos, R; Jorge, A;

Publication
THIRTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, AAAI-25, VOL 39 NO 24

Abstract
Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages. We open source the code, corpus, and models to foster further research in this task.

2025

Tradutor: Building a Variety Specific Translation Model

Authors
Sousa, H; Almasian, S; Campos, R; Jorge, A;

Publication
THIRTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, AAAI-25, VOL 39 NO 24

Abstract
Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.

2025

Screening Urban Soil Contamination in Rome: Insights from XRF and Multivariate Analysis

Authors
Chandramohan, MS; da Silva, IM; Ribeiro, RP; Jorge, A; da Silva, JE;

Publication
ENVIRONMENTS

Abstract
This study investigates spatial distribution and chemical elemental composition screening in soils in Rome (Italy) using X-ray fluorescence analysis. Fifty-nine soil samples were collected from various locations within the urban areas of the Rome municipality and were analyzed for 19 elements. Multivariate statistical techniques, including nonlinear mapping, principal component analysis, and hierarchical cluster analysis, were employed to identify clusters of similar soil samples and their spatial distribution and to try to obtain environmental quality information. The soil sample clusters result from natural geological processes and anthropogenic activities on soil contamination patterns. Spatial clustering using the k-means algorithm further identified six distinct clusters, each with specific geographical distributions and elemental characteristics. Hence, the findings underscore the importance of targeted soil assessments to ensure the sustainable use of land resources in urban areas.

2025

MedLink: Retrieval and Ranking of Case Reports to Assist Clinical Decision Making

Authors
Cunha, LF; Guimarães, N; Mendes, A; Campos, R; Jorge, A;

Publication
ECIR (5)

Abstract
In healthcare, diagnoses usually rely on physician expertise. However, complex cases may benefit from consulting similar past clinical reports cases. In this paper, we present MedLink (http://medlink.inesctec.pt), a tool that given a free-text medical report, retrieves and ranks relevant clinical case reports published in health conferences and journals, aiming to support clinical decision-making, particularly in challenging or complex diagnoses. To this regard, we trained two BERT models on the sentence similarity task: a bi-encoder for retrieval and a cross-encoder for reranking. To evaluate our approach, we used 10 medical reports and asked a physician to rank the top 10 most relevant published case reports for each one. Our results show that MedLink’s ranking model achieved NDCG@10 of 0.747. Our demo also includes the visualization of clinical entities (using a NER model) and the production of a textual explanation (using a LLM) to ease comparison and contrasting between reports.

2025

NarratEX Dataset: Explaining the Dominant Narratives in News Texts

Authors
Guimarães, N; Silvano, P; Campos, R; Jorge, AM; Pacheco, AF; Dimitrov, DI; Nikolaidis, N; Yangarber, R; Sartori, E; Stefanovitch, N; Nakov, P; Piskorski, J; San Martino, GD;

Publication
EMNLP (Findings)

Abstract
We present NarratEX, a dataset designed for the task of explaining the choice of the Dominant Narrative in a news article, and intended to support the research community in addressing challenges such as discourse polarization and propaganda detection. Our dataset comprises 1,056 news articles in four languages, Bulgarian, English, Portuguese, and Russian, covering two globally significant topics: the Ukraine-Russia War (URW) and Climate Change (CC). Each article is manually annotated with a dominant narrative and sub-narrative labels, and an explanation justifying the chosen labels. We describe the dataset, the process of its creation, and its characteristics. We present experiments with two new proposed tasks: Explaining Dominant Narrative based on Text, which involves writing a concise paragraph to justify the choice of the dominant narrative and sub-narrative of a given text, and Inferring Dominant Narrative from Explanation, which involves predicting the appropriate dominant narrative category based on an explanatory text. The proposed dataset is a valuable resource for advancing research on detecting and mitigating manipulative content, while promoting a deeper understanding of how narratives influence public discourse.

2025

The incremental process of building an annotation scheme for clinical narratives in portuguese: the contribution of human variation analysis

Authors
Ana Luisa Fernandes; Purificação Silvano; António Leal; Nuno Guimarães; Rita Rb-Silva; Luís Filipe Cunha; Alípio Jorge;

Publication
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)

Abstract
The development of a robust annotation scheme and corresponding guidelines is crucial for pro- ducing annotated datasets that advance both lin- guistic and computational research. This paper presents a case study that outlines a method- ology for designing an annotation scheme and its guidelines, specifically aimed at represent- ing morphosyntactic and semantic information regarding temporal features, as well as medi- cal information in medical reports written in Portuguese. We detail a multi-step process that includes reviewing existing frameworks, con- ducting an annotation experiment to determine the optimal approach, and designing a model based on these findings. We validated the ap- proach through a pilot experiment where we assessed the reliability and applicability of the annotation scheme and guidelines. In this ex- periment, two annotators independently anno- tated a patient's medical report consisting of six documents using the proposed model, while a curator established the ground truth. The analy- sis of inter-annotator agreement and the annota- tion results enabled the identification of sources of human variation and provided insights for further refinement of the annotation scheme and guidelines.

  • 16
  • 529