Publications

Publications by CRACS

2026

Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track and Demo Track - European Conference, ECML PKDD 2025, Porto, Portugal, September 15-19, 2025, Proceedings, Part X

Authors
Dutra, I; Pechenizkiy, M; Cortez, P; Pashami, S; Pasquali, A; Moniz, N; Jorge, AM; Soares, C; Abreu, PH; Gama, J;

Publication
ECML/PKDD (10)

Abstract

2026

Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track

Authors
Dutra, I; Pechenizkiy, M; Cortez, P; Pashami, S; Jorge, AM; Soares, C; Abreu, PH; Gama, J;

Publication
Lecture Notes in Computer Science

Abstract

2026

Enhancing Cellular Line Representation with Transformer-Based Text Embeddings for Precision Drug Repositioning

Authors
Carrera, I; Criollo, J; Dutra, I;

Publication
SMART TECHNOLOGIES, SYSTEMS AND APPLICATIONS, SMARTTECH-IC 2024, PT I

Abstract
This paper presents a novel approach to the computational representation of cellular lines using transformer-based embeddings. By leveraging state-of-the-art natural language processing techniques, we generate context-aware embeddings from biomedical literature from the PubMed database, offering a more nuanced and biologically relevant representation of cellular lines compared to traditional methods like TF-IDF and SVDD. We applied these embeddings to cluster cellular lines, using the elbow method to identify a set of distinct clusters that reflect biologically meaningful relationships. To evaluate the quality of these clusters, we employed the Topic Coherence metric, achieving a coherence score of 0.395, indicative of moderate consistency across clusters. The results demonstrate the potential of transformer-based models to improve drug discovery by identifying shared characteristics between cellular lines, enabling more accurate drug response predictions and advancing personalized medicine. This method offers an interesting improvement in the precision of cellular line modeling, paving the way for more efficient drug repositioning and targeted therapies in cancer research.

CloseRead Abstract

2026

Evaluating Transfer Learning Methods on Real-World Data Streams: A Case Study in Financial Fraud Detection

Authors
Pereira, RR; Bono, J; Ferreira, H; Ribeiro, P; Soares, C; Bizarro, P;

Publication
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES. APPLIED DATA SCIENCE TRACK, ECML PKDD 2025, PT IX

Abstract
When the available data for a target domain is limited, transfer learning (TL) methods leverage related data-rich source domains to train and evaluate models, before deploying them on the target domain. However, most TL methods assume fixed levels of labeled and unlabeled target data, which contrasts with real-world scenarios where both data and labels arrive progressively over time. As a result, evaluations based on these static assumptions may not reflect how methods perform in practice. To support a more realistic assessment of TL methods in dynamic settings, we propose an evaluation framework that (1) simulates varying data availability over time, (2) creates multiple domains via resampling of a given dataset and (3) introduces inter-domain variability through controlled transformations, e.g., including time-dependent covariate and concept shifts. These capabilities enable the systematic simulation of a large number of variants of the experiments, providing deeper insights into how algorithms may behave when deployed. We demonstrate the usefulness of the proposed framework by performing a case study on a proprietary real-world suite of card payment datasets. To support reproducibility, we also apply the framework on the publicly available Bank Account Fraud (BAF) dataset. By providing a methodology for evaluating TL methods over time and in different data availability conditions, our framework supports a better understanding of model behavior in real-world environments, which enables more informed decisions when deploying models in new domains.

CloseRead Abstract

2026

Optimizing Medical Image Captioning with Conditional Prompt Encoding

Authors
Fernandes, RF; Oliveira, HS; Ribeiro, PP; Oliveira, HP;

Publication
PATTERN RECOGNITION AND IMAGE ANALYSIS, IBPRIA 2025, PT II

Abstract
Medical image captioning is an essential tool to produce descriptive text reports of medical images. One of the central problems of medical image captioning is their poor domain description generation because large pre-trained language models are primarily trained in non-medical text domains with different semantics of medical text. To overcome this limitation, we explore improvements in contrastive learning for X-ray images complemented with soft prompt engineering for medical image captioning and conditional text decoding for caption generation. The main objective is to develop a softprompt model to improve the accuracy and clinical relevance of the automatically generated captions while guaranteeing their complete linguistic accuracy without corrupting the models' performance. Experiments on the MIMIC-CXR and ROCO datasets showed that the inclusion of tailored soft-prompts improved accuracy and efficiency, while ensuring a more cohesive medical context for captions, aiding medical diagnosis and encouraging more accurate reporting.

CloseRead Abstract

2026

Large Language Model Framework for Log Sequence Anomaly Detection

Authors
Reis, J; Areias, M; Barbosa, JG;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2025, PT I

Abstract
Log analysis is fundamental to modern software observability systems, playing a key role in improving system reliability. Recently, there has been a growing adoption of Large Language Models (LLMs) for log anomaly detection, due to their ability to learn complex patterns. In this work, we propose a model-agnostic framework that allows seamless plug-and-play integration of different LLMs, making it easy to experiment with and select the model that fits specific needs. These models are first fine-tuned on normal log data, learning their patterns. During inference, the model predicts the most probable next tokens based on the preceding context in each sequence. Anomaly detection is performed using Top-K predictions, where sequences are flagged as anomalous if the actual log entry does not appear among the K most probable next tokens, with K determined using the validation dataset. The proposed framework is evaluated on three widely-used benchmark datasets-HDFS, BGL, and Thunderbird-where it consistently achieves competitive results, outperforming state-of-the-art methods in multiple scenarios. These results highlight the effectiveness of LLM-based log analysis and the importance of flexibility when selecting models for specific operational contexts.

CloseRead Abstract