2025
Authors
Fernandes, AL; Silvano, P; Guimarães, N; Silva, RR; Munna, TA; Cunha, LF; Leal, A; Campos, R; Jorge, A;
Publication
Proceedings of Text2Story - Eighth Workshop on Narrative Extraction From Texts held in conjunction with the 47th European Conference on Information Retrieval (ECIR 2025), Lucca, Italy, April 10, 2025.
Abstract
Electronic Health Records (EHRs) contain vast amounts of unstructured narrative text, posing challenges for organization, curation, and automated information extraction in clinical and research settings. Developing effective annotation schemes is crucial for training extraction models, yet it remains complex for both human experts and Large Language Models (LLMs). This study compares human- and LLM-generated annotation schemes and guidelines through an experimental framework. In the first phase, both a human expert and an LLM created annotation schemes based on predefined criteria. In the second phase, experienced annotators applied these schemes following the guidelines. In both cases, the results were qualitatively evaluated using Likert scales. The findings indicate that the human-generated scheme is more comprehensive, coherent, and clear compared to those produced by the LLM. These results align with previous research suggesting that while LLMs show promising performance with respect to text annotation, the same does not apply to the development of annotation schemes, and human validation remains essential to ensure accuracy and reliability. © 2025 Copyright for this paper by its authors.
2025
Authors
Munna, TA; Fernandes, AL; Silvano, P; Guimarães, N; Jorge, A;
Publication
Proceedings of Text2Story - Eighth Workshop on Narrative Extraction From Texts held in conjunction with the 47th European Conference on Information Retrieval (ECIR 2025), Lucca, Italy, April 10, 2025.
Abstract
The relationship of a patient with a hospital from admission to discharge is often kept in a series of textual documents that describe the patient’s journey. These documents are important to analyze the different steps of the clinical process and to make aggregated studies of the paths of patients in the hospital. In this paper, we explore the potential of Large Language Models (LLMs) to generate realistic and comprehensive patient journeys in European Portuguese, addressing the scarcity of medical data in this specific context. We employed Google’s Gemini 1.5 Flash model and utilized a dataset of 285 European Portuguese published case reports from the SPMI website, published by the Portuguese Society of Internal Medicine, as references for generating synthetic medical reports. Our methodology involves a sequential approach to generating a synthetic patient journey. Initially, we generate an admission report, followed by a discharge report. Subsequently, we generate a comprehensive patient journey that integrates the admission, multiple daily progress reports, and the discharge into a cohesive narrative. This end-to-end process ensures a realistic and detailed representation of the patient’s clinical pathway as a patient’s journey. The generated reports were rigorously evaluated by medical and linguistic professionals, as well as automatic metrics to measure the inclusion of key medical entities, similarity to the case report, and correct Portuguese variant. Both qualitative and quantitative evaluations confirmed that the generated synthetic reports are predominantly written in European Portuguese without the loss of important medical information from the case reports. This work contributes to developing high-quality synthetic medical data for training LLMs and advancing AI-driven healthcare applications in under-resourced language settings. © 2025 Copyright for this paper by its authors.
2025
Authors
Muratov, A; Shaikh, HF; Jani, V; Mahmoud, T; Xie, Z; Orel, D; Singh, A; Wang, Y; Joshi, A; Iqbal, H; Hee, MS; Sahnan, D; Nikolaidis, N; Silvano, P; Dimitrov, D; Yangarber, R; Campos, R; Jorge, A; Guimarães, N; Sartori, E; Stefanovitch, N; San Martino, GD; Piskorski, J; Nakov, P;
Publication
CoRR
Abstract
2025
Authors
Pacheco, AF; Guimarães, N; Torres, A; Silvano, P; Almeida, I;
Publication
Revista da Associação Portuguesa de Linguística
Abstract
2025
Authors
Leite, M; Silva, RR; Guimarães, N; Stork, L; Jorge, A;
Publication
Progress in Artificial Intelligence - 24th EPIA Conference on Artificial Intelligence, EPIA 2025, Faro, Portugal, October 1-3, 2025, Proceedings, Part I
Abstract
Providing healthcare professionals with quick access to structured standardized information enables comprehensive analysis and improves clinical decision-making. However, an important part of the records in health institutions is in the form of free text. This paper proposes a pipeline that automatically extracts medical information from Electronic Medical Records (EMRs), based on large language models (LLMs) and a domain ontology defined and validated in collaboration with a medical expert. The output is a knowledge graph of clinical narratives that can be used to search through repositories of EMRs or discover new facts. To promote the standardization of the extracted medical terms, we link them to existing international coding systems using biomedical repositories (UMLS - Unified Medical Language System and BioPortal - Biomedical Ontology Repository). We showcase our approach on a set of Portuguese clinical texts of cases of Acute Myeloid Leukemia (AML) guided by one medical expert. We evaluate the quality of the extraction and of the knowledge graph. © 2025 Elsevier B.V., All rights reserved.
2025
Authors
Henriques, L; Guimarães, N; Jorge, A;
Publication
Progress in Artificial Intelligence - 24th EPIA Conference on Artificial Intelligence, EPIA 2025, Faro, Portugal, October 1-3, 2025, Proceedings, Part I
Abstract
The ever-increasing volume of data produced in Healthcare demands solutions capable of automatically extracting the relevant elements of their narratives. However, given privacy regulations, bureaucratic procedures, and annotation efforts, the development of said solutions via Natural Language Processing (NLP) systems becomes hindered due to training data scarcity. Such scarcity increases when we consider languages and language varieties with lower resource availability, such as European and Brazilian Portuguese. To address this problem, we propose a Large Language Model (LLM)-based SDG (Synthetic Data Generation) framework to generate and annotate synthetic clinical texts for medical Named-Entity Recognition (NER). The SDG framework consists of a system/user prompt augmented with real examples, powered by GPT-4o. Our results show that, by feeding the framework few real clinical annotated texts, we can generate synthetic data capable of increasing the performance of NER models with respect to their non-augmented counterparts. In addition, the reduction of the BLEU scores in the generated texts indicates a decrease in the risk of privacy disclosure while ensuring greater lexical diversity. These results highlight the potential of synthetic data as a solution to overcome human annotation bottlenecks and privacy concerns, laying the groundwork for future research in clinical NLP across tasks, domains, and low-resource languages. © 2025 Elsevier B.V., All rights reserved.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.