Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by Luís Filipe Cunha

2025

Human Experts vs. Large Language Models: Evaluating Annotation Scheme and Guidelines Development for Clinical Narratives

Authors
Fernandes, AL; Silvano, P; Guimarães, N; Silva, RR; Munna, TA; Cunha, LF; Leal, A; Campos, R; Jorge, A;

Publication
Proceedings of Text2Story - Eighth Workshop on Narrative Extraction From Texts held in conjunction with the 47th European Conference on Information Retrieval (ECIR 2025), Lucca, Italy, April 10, 2025.

Abstract
Electronic Health Records (EHRs) contain vast amounts of unstructured narrative text, posing challenges for organization, curation, and automated information extraction in clinical and research settings. Developing effective annotation schemes is crucial for training extraction models, yet it remains complex for both human experts and Large Language Models (LLMs). This study compares human- and LLM-generated annotation schemes and guidelines through an experimental framework. In the first phase, both a human expert and an LLM created annotation schemes based on predefined criteria. In the second phase, experienced annotators applied these schemes following the guidelines. In both cases, the results were qualitatively evaluated using Likert scales. The findings indicate that the human-generated scheme is more comprehensive, coherent, and clear compared to those produced by the LLM. These results align with previous research suggesting that while LLMs show promising performance with respect to text annotation, the same does not apply to the development of annotation schemes, and human validation remains essential to ensure accuracy and reliability. © 2025 Copyright for this paper by its authors.

2022

NER in Archival Finding Aids: Extended

Authors
Cunha, LFD; Ramalho, JC;

Publication
MACHINE LEARNING AND KNOWLEDGE EXTRACTION

Abstract
The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country's history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.

2022

NER in Archival Finding Aids: Next Level

Authors
Cunha, LFD; Ramalho, JC;

Publication
INFORMATION SYSTEMS AND TECHNOLOGIES, WORLDCIST 2022, VOL 2

Abstract
Currently, there is a vast amount of archival finding aids in Portuguese archives, however, these documents lack structure (are not annotated) making them hard to process and work with. In this way, we intend to extract and classify entities of interest, like geographical locations, people's names, dates, etc. For this, we will use an architecture that has been revolutionizing several NLP tasks, Transformers, presenting several models in order to achieve high results. It is also intended to understand what will be the degree of improvement that this new mechanism will present in comparison with previous architectures. Can Transformer-based models replace the LSTMs in NER? We intend to answer this question along this paper.

2021

NER in Archival Finding Aids

Authors
Costa Cunha, LF; Ramalho, JC;

Publication
10th Symposium on Languages, Applications and Technologies, SLATE 2021, July 1-2, 2021, Vila do Conde/Póvoa de Varzim, Portugal.

Abstract
At the moment, the vast majority of Portuguese archives with an online presence use a software solution to manage their finding aids: e.g. Digitarq or Archeevo. Most of these finding aids are written in natural language without any annotation that would enable a machine to identify named entities, geographical locations or even some dates. That would allow the machine to create smart browsing tools on top of those record contents like entity linking and record linking. In this work we have created a set of datasets to train Machine Learning algorithms to find those named entities and geographical locations. After training several algorithms we tested them in several datasets and registered their precision and accuracy. These results enabled us to achieve some conclusions about what kind of precision we can achieve with this approach in this context and what to do with the results: do we have enough precision and accuracy to create toponymic and anthroponomic indexes for archival finding aids? Is this approach suitable in this context? These are some of the questions we intend to answer along this paper.

2022

Fine-Tuning BERT Models to Extract Named Entities from Archival Finding Aids

Authors
Costa Cunha, LF; Ramalho, JC;

Publication
Proceedings of the 26th International Conference on Theory and Practice of Digital Libraries - Workshops and Doctoral Consortium, Padua, Italy, September 20, 2022.

Abstract
In recent works, several NER models were developed to extract named entities from Portuguese Archival Finding Aids. In this paper, we are complementing the work already done by presenting a different NER model with a new architecture, Bidirectional Encoding Representation from Transformers (BERT). In order to do so, we used a BERT model that was pre-trained in Portuguese vocabulary and fine-tuned it to our concrete classification problem, NER. In the end, we compared the results obtained with previous architectures. In addition to this model we also developed an annotation tool that uses ML models to speed up the corpora annotation process. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

2022

Reasoning with Portuguese Word Embeddings

Authors
Costa Cunha, LF; Almeida, JJ; Simões, A;

Publication
11th Symposium on Languages, Applications and Technologies, SLATE 2022, July 14-15, 2022, Universidade da Beira Interior, Covilhã, Portugal.

Abstract
Representing words with semantic distributions to create ML models is a widely used technique to perform Natural Language processing tasks. In this paper, we trained word embedding models with different types of Portuguese corpora, analyzing the influence of the models’ parameterization, the corpora size, and domain. Then we validated each model with the classical evaluation methods available: four words analogies and measurement of the similarity of pairs of words. In addition to these methods, we proposed new alternative techniques to validate word embedding models, presenting new resources for this purpose. Finally, we discussed the obtained results and argued about some limitations of the word embedding models’ evaluation methods. © Luís Filipe Cunha, J. João Almeida, and Alberto Simões.

  • 2
  • 3