Publicacoes - INESC TEC

Publicações

Publicações por Alípio Jorge

2026

ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles

Autores
Campos, R; Sequeira, R; Nerea, S; Cantante, I; Folques, D; Cunha, LF; Canavilhas, J; Branco, A; Jorge, A; Nunes, S; Guimaraes, N; Silvano, P;

Publicação
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2026, PT IV

Abstract
Fact-checking remains a demanding and time-consuming task, still largely dependent on manual verification and unable to match the rapid spread of misinformation online. This is particularly important because debunking false information typically takes longer to reach consumers than the misinformation itself; accelerating corrections through automation can therefore help counter it more effectively. Although many organizations perform manual fact-checking, this approach is difficult to scale given the growing volume of digital content. These limitations have motivated interest in automating fact-checking, where identifying claims is a crucial first step. However, progress has been uneven across languages, with English dominating due to abundant annotated data. Portuguese, like other languages, still lacks accessible, licensed datasets, limiting research, Natural Language Processing (NLP) developments, and applications. In this paper, we introduce ClaimPT, a dataset of European Portuguese news articles annotated for factual claims, comprising 1,308 articles and 6,875 individual annotations. Unlike most existing resources based on social media or parliamentary transcripts, ClaimPT focuses on journalistic content, collected through a partnership with LUSA, the Portuguese News Agency. To ensure annotation quality, two trained annotators labeled each article, with a curator validating all annotations according to a newly proposed scheme. We also provide baseline models for claim detection, establishing initial benchmarks and enabling future NLP and Information Retrieval (IR) applications. By releasing ClaimPT, we aim to advance research on low-resource fact-checking and enhance understanding of misinformation in news media.

FecharLer Abstract

2026

CitiLink: Enhancing Municipal Transparency and Citizen Engagement Through Searchable Meeting Minutes

Autores
Silva, R; Evans, J; Isidro, J; Marques, M; Fonseca, A; Morais, R; Canavilhas, J; Pasquali, A; Silvano, P; Jorge, A; Guimaraes, N; Nunes, S; Campos, R;

Publicação
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2026, PT IV

Abstract
City council minutes are typically lengthy and formal documents with a bureaucratic writing style. Although publicly available, their structure often makes it difficult for citizens or journalists to efficiently find information. In this demo, we present CitiLink, a platform designed to transform unstructured municipal meeting minutes into structured and searchable data, demonstrating how NLP and IR can enhance the accessibility and transparency of local government. The system employs LLMs to extract metadata, discussed subjects, and voting outcomes, which are then indexed in a database to support full-text search with BM25 ranking and faceted filtering through a user-friendly interface. The developed system was built over a collection of 120 min made available by six Portuguese municipalities. To assess its usability, CitiLink was tested through guided sessions with municipal personnel, providing insights into how real users interact with the system. In addition, we evaluated Geminis performance in extracting relevant information from the minutes, highlighting its performance in data extraction.

FecharLer Abstract

2026

VotIE: Information Extraction from Meeting Minutes

Autores
Evans, JP; Cunha, LF; Silvano, P; Jorge, A; Guimarães, N; Nunes, S; Campos, R;

Publicação
CoRR

Abstract

2026

SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation

Autores
Isidro, J; Cunha, LF; Silvano, P; Jorge, A; Guimarães, N; Nunes, S; Campos, R;

Publicação
CoRR

Abstract

2026

MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes

Autores
Batista, R; Cunha, LF; Silvano, P; Guimaraes, N; Jorge, A; Amorim, E; Campos, R;

Publicação
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2026, PT II

Abstract
Municipal meeting minutes are official documents of local governance that exhibit heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easily extracted automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question-answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction, with deslexicalization explored as an additional modeling strategy. We benchmark the pipeline against open and closed-weight LLMs (Phi and Gemini), considering performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, outperforming the evaluated LLMs. Differences observed in cross-municipality evaluation highlight the linguistic diversity and structural variation across municipal records, underscoring the challenges of generalization in this domain and motivating future research in metadata extraction from municipal minutes.

FecharLer Abstract

2025

NarratEX Dataset: Explaining the Dominant Narratives in News Texts

Autores
Guimarães, N; Silvano, P; Campos, R; Jorge, AM; Pacheco, AF; Dimitrov, DI; Nikolaidis, N; Yangarber, R; Sartori, E; Stefanovitch, N; Nakov, P; Piskorski, J; San Martino, GD;

Publicação
EMNLP (Findings)

Abstract
We present NarratEX, a dataset designed for the task of explaining the choice of the Dominant Narrative in a news article, and intended to support the research community in addressing challenges such as discourse polarization and propaganda detection. Our dataset comprises 1,056 news articles in four languages, Bulgarian, English, Portuguese, and Russian, covering two globally significant topics: the Ukraine-Russia War (URW) and Climate Change (CC). Each article is manually annotated with a dominant narrative and sub-narrative labels, and an explanation justifying the chosen labels. We describe the dataset, the process of its creation, and its characteristics. We present experiments with two new proposed tasks: Explaining Dominant Narrative based on Text, which involves writing a concise paragraph to justify the choice of the dominant narrative and sub-narrative of a given text, and Inferring Dominant Narrative from Explanation, which involves predicting the appropriate dominant narrative category based on an explanatory text. The proposed dataset is a valuable resource for advancing research on detecting and mitigating manipulative content, while promoting a deeper understanding of how narratives influence public discourse.

FecharLer Abstract