Publicacoes - INESC TEC

Publicações

Publicações por LIAAD

2026

MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes

Autores
Batista, R; Cunha, LF; Silvano, P; Guimaraes, N; Jorge, A; Amorim, E; Campos, R;

Publicação
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2026, PT II

Abstract
Municipal meeting minutes are official documents of local governance that exhibit heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easily extracted automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question-answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction, with deslexicalization explored as an additional modeling strategy. We benchmark the pipeline against open and closed-weight LLMs (Phi and Gemini), considering performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, outperforming the evaluated LLMs. Differences observed in cross-municipality evaluation highlight the linguistic diversity and structural variation across municipal records, underscoring the challenges of generalization in this domain and motivating future research in metadata extraction from municipal minutes.

FecharLer Abstract

2026

Can LLMs Reliably Label YouTube Videos? A Committee-Based Evaluation

Autores
Mourthé, A; Mello, CE; Jorge, A;

Publicação
SOCIAL NETWORKS ANALYSIS AND MINING, ASONAM 2025, PT I

Abstract
As recommender systems play an increasingly central role in shaping information exposure on platforms like YouTube, understanding the nature of the content they promote, especially in sensitive contexts, requires scalable and reliable labelling methods. This paper investigates the use of Large Language Models (LLM) to label YouTube videos based solely on their metadata. We propose a committee-based approach that aggregates predictions from an ensemble of seven state-of-the-art LLMs through majority voting. Using a novel dataset collected via simulated user interactions on YouTube, we analyse model agreement, labelling behavior, and the influence of model size. To assess label reliability, we also investigate the semantic coherence of label assignments. Our results show that LLM committees produce highly consistent labels in low-disagreement settings. These findings highlight both the promise and limitations of LLM-based annotation for auditing social networks.

FecharLer Abstract

2026

The 9th International Workshop on Narrative Extraction from Text: Text2Story 2026

Autores
Campos, R; Jorge, A; Jatowt, A; Bhatia, S; Litvak, M;

Publicação
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2026, PT III

Abstract
For eight years, the Text2Story Workshop series has fostered a vibrant research community dedicated to narrative understanding, advancing shared insights into the challenges of modelling narrative structure in text. While earlier approaches laid important foundations, recent progress in Transformers and Large Language Models (LLMs) has fundamentally reshaped the field. Building on the increasing prominence of LLM-based contributions in recent editions, the ninth edition of Text2Story expands the focus toward agentic AI, where systems plan, reason, and interact over time using narratives as internal representations. Recent advances, including long-context architectures, instruction and preference-tuned models, retrieval-augmented generation, and discourse-aware prompting, have broadened the applicability of LLMs to complex narrative tasks. Nevertheless, reliably capturing fine-grained narrative structures remains challenging, particularly for event chains, temporal and causal relations, character development, and perspective consistency. These challenges are amplified in interactive and agentic settings, where narrative coherence, controllability, and reliability are critical. This edition of Text2Story explores both the opportunities and limitations of LLMs and agentic systems for narrative understanding, including the analysis of narratives generated by LLMs themselves with respect to consistency, hallucination, bias, and control. Through a diverse program of research papers, works in progress, demos, resources, and keynote talks, the workshop continues to advance narrative understanding in the era of foundation and agentic models.

FecharLer Abstract

2026

NLP for Local Governance Meeting Records: A Focus Article on Tasks, Datasets, Metrics and Benchmark

Autores
Campos, R; Evans, JP; Isidro, J; Marques, M; Cunha, LF; Jorge, A; Nunes, S; Guimarães, N;

Publicação
CoRR

Abstract

2026

EPHG-CR: embedding propagation for heterogeneous graphs with class refinement

Autores
Dos Santos, BN; Marcacini, RM; Jorge, AM; Campos, R; Rezende, SO;

Publicação
APPLIED INTELLIGENCE

Abstract
Heterogeneous graphs can represent real-world problems in a way close to reality, supporting diverse types of vertices and edges. However, their inherent heterogeneity poses challenges in interpreting problem semantics. To address this, heterogeneous graph embedding, aiming to map graph elements to low-dimensional vectors, simplifies subsequent machine learning analysis. This approach has gained prominence in machine learning, fueling classification, recommendation, and similarity search applications. Embedding diverse data is essential for efficient data processing. Incorporating language models, like BERT, into heterogeneous graphs enhances semantic context capture, which is particularly useful when one vertex type represents text. Language models stand out in contextual representation, enriching graph vertex embeddings for various tasks. This paper proposes a novel approach to enhancing heterogeneous graph embeddings by combining language models and task class data. Our approach increases vector quality, accounting for graph structure, semantic textual information, and task labels. We compared our proposal with a language model in the aspect-based sentiment analysis task, demonstrating competitive results and, in some cases, a slight superiority. Furthermore, we explore applications of embeddings from auxiliary vertices in another task, highlighting another advantage of the approach over the language model.

FecharLer Abstract

2026

Preface

Autores
Ribeiro, P; Japkowicz, N; Jorge, AM; Soares, C; Abreu, PH; Pfahringer, B; Gama, MP; Larrañaga, P; Dutra, I; Pechenizkiy, M; Pashami, S; Cortez, P;

Publicação
Lecture Notes in Computer Science

Abstract
[No abstract available]

FecharLer Abstract