2025
Autores
Ermakova, L; Bosser, AG; Miller, T; Campos, R;
Publicação
Advances in Information Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings, Part V
Abstract
Over the last three years, the JOKER Lab series at CLEF has gathered an active community of researchers in natural language processing and information retrieval to collaborate on non-literal use of language in text. Such language can be a challenge for AI systems, but also sometimes for humans, as it requires understanding implicit cultural references and unorthodox interactions between form and meaning. In this paper, we discuss the lessons learned from the previous iterations of the Lab and describe how its upcoming edition will build upon those to address new challenges. In 2025, JOKER will provide novel tasks and update some previous ones with new data and new languages. This year we provide sandbox environments for experimenting with humour-aware information retrieval (Task 1), a previously featured task now enhanced with an all-new Portuguese corpus; wordplay translation in text (Task 2), another historical task for which we provide new corpora; onomastic wordplay (Task 3), a new task focussed on humorous proper names in fiction; and controlled creativity (Task 4), another novel task that aims at identifying and avoiding hallucinations. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
2025
Autores
Silva, R; Campos, R;
Publicação
Advances in Information Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings, Part V
Abstract
Around 80% of websites change significantly or disappear altogether after the first year, resulting in the loss of invaluable information. In this volatile scenario, preserving online content is increasingly essential. This is especially critical for local news outlets, which produce a wealth of information within the unique context of their communities but often lack sufficient archiving resources. In this paper, we take a significant step forward by leveraging the information preserved by the Portuguese Web Archive, Arquivo.pt, to recreate the website of a local news outlet. This online demo grants users direct access to previously lost news articles, images, and front covers, thus contributing to preserving local digital heritage. An IR system was also implemented to ensure easy access, along with a recommendation system based on BERT embeddings to suggest related news articles and enhance user engagement. As a final contribution, we also provide a Python package, enabling others to replicate the process of collecting, processing, retrieving, and recreating websites for local news outlets in Portugal. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
2025
Autores
Muratov, A; Shaikh, HF; Jani, V; Mahmoud, T; Xie, Z; Orel, D; Singh, A; Wang, Y; Joshi, A; Iqbal, H; Hee, MS; Sahnan, D; Nikolaidis, N; Silvano, P; Dimitrov, D; Yangarber, R; Campos, R; Jorge, A; Guimarães, N; Sartori, E; Stefanovitch, N; San Martino, GD; Piskorski, J; Nakov, P;
Publicação
CoRR
Abstract
2025
Autores
Fernandes, AL; Silvano, P; Guimarães, N; Silva, RR; Munna, TA; Cunha, LF; Leal, A; Campos, R; Jorge, A;
Publicação
Proceedings of Text2Story - Eighth Workshop on Narrative Extraction From Texts held in conjunction with the 47th European Conference on Information Retrieval (ECIR 2025), Lucca, Italy, April 10, 2025.
Abstract
Electronic Health Records (EHRs) contain vast amounts of unstructured narrative text, posing challenges for organization, curation, and automated information extraction in clinical and research settings. Developing effective annotation schemes is crucial for training extraction models, yet it remains complex for both human experts and Large Language Models (LLMs). This study compares human- and LLM-generated annotation schemes and guidelines through an experimental framework. In the first phase, both a human expert and an LLM created annotation schemes based on predefined criteria. In the second phase, experienced annotators applied these schemes following the guidelines. In both cases, the results were qualitatively evaluated using Likert scales. The findings indicate that the human-generated scheme is more comprehensive, coherent, and clear compared to those produced by the LLM. These results align with previous research suggesting that while LLMs show promising performance with respect to text annotation, the same does not apply to the development of annotation schemes, and human validation remains essential to ensure accuracy and reliability. © 2025 Copyright for this paper by its authors.
2025
Autores
Cunha, LF; Yu, N; Silvano, P; Campos, R; Jorge, A;
Publicação
Advances in Information Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings, Part V
Abstract
Manual text annotation is a complex and time-consuming task. However, recent advancements demonstrate that such a task can be accelerated with automated pre-annotation. In this paper, we present a methodology to improve the efficiency of manual text annotation by leveraging LLMs for text pre-annotation. For this purpose, we train a BERT model for a token classification task and integrate it into the INCEpTION annotation tool to generate span-level suggestions for human annotators. To assess the usefulness of our approach, we conducted an experiment where an experienced linguist annotated plain text both with and without our model’s pre-annotations. Our results show that the model-assisted approach reduces annotation time by nearly 23%. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
2025
Autores
Campos, R; Jorge, M; Jatowt, A; Bhatia, S; Litvak, M;
Publicação
CEUR Workshop Proceedings
Abstract
[No abstract available]
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.