Cookies
O website necessita de alguns cookies e outros recursos semelhantes para funcionar. Caso o permita, o INESC TEC irá utilizar cookies para recolher dados sobre as suas visitas, contribuindo, assim, para estatísticas agregadas que permitem melhorar o nosso serviço. Ver mais
Aceitar Rejeitar
  • Menu
Publicações

Publicações por Alípio Jorge

2025

Using LLMs to Generate Patient Journeys in Portuguese: an Experiment

Autores
Munna, TA; Fernandes, AL; Silvano, P; Guimarães, N; Jorge, A;

Publicação
Proceedings of Text2Story - Eighth Workshop on Narrative Extraction From Texts held in conjunction with the 47th European Conference on Information Retrieval (ECIR 2025), Lucca, Italy, April 10, 2025.

Abstract
The relationship of a patient with a hospital from admission to discharge is often kept in a series of textual documents that describe the patient’s journey. These documents are important to analyze the different steps of the clinical process and to make aggregated studies of the paths of patients in the hospital. In this paper, we explore the potential of Large Language Models (LLMs) to generate realistic and comprehensive patient journeys in European Portuguese, addressing the scarcity of medical data in this specific context. We employed Google’s Gemini 1.5 Flash model and utilized a dataset of 285 European Portuguese published case reports from the SPMI website, published by the Portuguese Society of Internal Medicine, as references for generating synthetic medical reports. Our methodology involves a sequential approach to generating a synthetic patient journey. Initially, we generate an admission report, followed by a discharge report. Subsequently, we generate a comprehensive patient journey that integrates the admission, multiple daily progress reports, and the discharge into a cohesive narrative. This end-to-end process ensures a realistic and detailed representation of the patient’s clinical pathway as a patient’s journey. The generated reports were rigorously evaluated by medical and linguistic professionals, as well as automatic metrics to measure the inclusion of key medical entities, similarity to the case report, and correct Portuguese variant. Both qualitative and quantitative evaluations confirmed that the generated synthetic reports are predominantly written in European Portuguese without the loss of important medical information from the case reports. This work contributes to developing high-quality synthetic medical data for training LLMs and advancing AI-driven healthcare applications in under-resourced language settings. © 2025 Copyright for this paper by its authors.

2025

Leveraging Synthetic Data to Develop a Machine Learning Model for Voiding Flow Rate Prediction From Audio Signals

Autores
Alvarez, ML; Bahillo, A; Arjona, L; Nogueira, DM; Gomes, EF; Jorge, AM;

Publicação
IEEE ACCESS

Abstract
Sound-based uroflowmetry (SU) is a non-invasive technique emerging as an alternative to traditional uroflowmetry (UF) to calculate the voiding flow rate based on the sound generated by the urine impacting the water in a toilet, enabling remote monitoring and reducing the patient burden and clinical costs. This study trains four different machine learning (ML) models (random forest, gradient boosting, support vector machine and convolutional neural network) using both regression and classification approaches to predict and categorize the voiding flow rate from sound events. The models were trained with a dataset that contains sounds from synthetic void events generated with a high precision peristaltic pump and a traditional toilet. Sound was simultaneously recorded with three devices: Ultramic384k, Mi A1 smartphone and Oppo Smartwatch. To extract the audio features, our analysis showed that segmenting the audio signals into 1000 ms segments with frequencies up to 16 kHz provided the best results. Results show that random forest achieved the best performance in both regression and classification tasks, with a mean absolute error (MAE) of 0.9, 0.7 and 0.9 ml/s and quadratic weighted kappa (QWK) of 0.99, 1.0 and 1.0 for the three devices. To evaluate the models in a real environment and assess the effectiveness of training with synthetic data, the best-performing models were retrained and validated using a real voiding sounds dataset. The results reported an MAE below 2.5 ml/s and a QWK above 0.86 for regression and classification tasks, respectively.

2025

Leveraging LLMs to Improve Human Annotation Efficiency with INCEpTION

Autores
Cunha, LF; Yu, N; Silvano, P; Campos, R; Jorge, A;

Publicação
Advances in Information Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings, Part V

Abstract
Manual text annotation is a complex and time-consuming task. However, recent advancements demonstrate that such a task can be accelerated with automated pre-annotation. In this paper, we present a methodology to improve the efficiency of manual text annotation by leveraging LLMs for text pre-annotation. For this purpose, we train a BERT model for a token classification task and integrate it into the INCEpTION annotation tool to generate span-level suggestions for human annotators. To assess the usefulness of our approach, we conducted an experiment where an experienced linguist annotated plain text both with and without our model’s pre-annotations. Our results show that the model-assisted approach reduces annotation time by nearly 23%. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

2025

Preface

Autores
Campos, R; Jorge, M; Jatowt, A; Bhatia, S; Litvak, M;

Publicação
CEUR Workshop Proceedings

Abstract
[No abstract available]

2025

The 8th International Workshop on Narrative Extraction from Texts: Text2Story 2025

Autores
Campos, R; Jorge, A; Jatowt, A; Bhatia, S; Litvak, M;

Publicação
Advances in Information Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings, Part V

Abstract
For seven years, the Text2Story Workshop series has fostered a vibrant community dedicated to understanding narrative structure in text, resulting in significant contributions to the field and developing a shared understanding of the challenges in this domain. While traditional methods have yielded valuable insights, the advent of Transformers and LLMs have ignited a new wave of interest in narrative understanding. The previous iteration of the workshop also witnessed a surge in LLM-based approaches, demonstrating the community’s growing recognition of their potential. In this eighth edition we propose to go deeper into the role of LLMs in narrative understanding. While LLMs have revolutionized the field of NLP and are the go-to tools for any NLP task, the ability to capture, represent and analyze contextual nuances in longer texts is still an elusive goal, let alone the understanding of consistent fine-grained narrative structures in text. Consequently, this iteration of the workshop will explore the issues involved in using LLMs to unravel narrative structures, while also examining the characteristics of narratives generated by LLMs. By fostering dialogue on these emerging areas, we aim to continue the workshop's tradition of driving innovation in narrative understanding research. Text2Story encompasses sessions covering full research papers, work-in-progress, demos, resources, position and dissemination papers, along with one keynote talk. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

2025

Enhancing Portuguese Variety Identification with Cross-Domain Approaches

Autores
Sousa, H; Almeida, R; Silvano, P; Cantante, I; Campos, R; Jorge, A;

Publicação
THIRTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, AAAI-25, VOL 39 NO 24

Abstract
Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages. We open source the code, corpus, and models to foster further research in this task.

  • 24
  • 46