2025
Autores
Sousa, H; Almasian, S; Campos, R; Jorge, A;
Publicação
THIRTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, AAAI-25, VOL 39 NO 24
Abstract
Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.
2025
Autores
Chandramohan, MS; da Silva, IM; Ribeiro, RP; Jorge, A; da Silva, JE;
Publicação
ENVIRONMENTS
Abstract
This study investigates spatial distribution and chemical elemental composition screening in soils in Rome (Italy) using X-ray fluorescence analysis. Fifty-nine soil samples were collected from various locations within the urban areas of the Rome municipality and were analyzed for 19 elements. Multivariate statistical techniques, including nonlinear mapping, principal component analysis, and hierarchical cluster analysis, were employed to identify clusters of similar soil samples and their spatial distribution and to try to obtain environmental quality information. The soil sample clusters result from natural geological processes and anthropogenic activities on soil contamination patterns. Spatial clustering using the k-means algorithm further identified six distinct clusters, each with specific geographical distributions and elemental characteristics. Hence, the findings underscore the importance of targeted soil assessments to ensure the sustainable use of land resources in urban areas.
2025
Autores
Cunha, LF; Guimarães, N; Mendes, A; Campos, R; Jorge, A;
Publicação
ECIR (5)
Abstract
In healthcare, diagnoses usually rely on physician expertise. However, complex cases may benefit from consulting similar past clinical reports cases. In this paper, we present MedLink (http://medlink.inesctec.pt), a tool that given a free-text medical report, retrieves and ranks relevant clinical case reports published in health conferences and journals, aiming to support clinical decision-making, particularly in challenging or complex diagnoses. To this regard, we trained two BERT models on the sentence similarity task: a bi-encoder for retrieval and a cross-encoder for reranking. To evaluate our approach, we used 10 medical reports and asked a physician to rank the top 10 most relevant published case reports for each one. Our results show that MedLink’s ranking model achieved NDCG@10 of 0.747. Our demo also includes the visualization of clinical entities (using a NER model) and the production of a textual explanation (using a LLM) to ease comparison and contrasting between reports.
2025
Autores
Shaji, N; Tabassum, S; Ribeiro, RP; Gama, J; Gorgulho, J; Garcia, A; Santana, P;
Publicação
APPLIED NETWORK SCIENCE
Abstract
Detecting anomalies in Waste transportation networks is vital for uncovering illegal or unsafe activities, that can have serious environmental and regulatory consequences. Identifying anomalies in such networks presents a significant challenge due to the limited availability of labeled data and the subtle nature of illicit activities. Moreover, traditional anomaly detection methods relying solely on individual transaction data may overlook deeper, network-level irregularities that arise from complex interactions between entities, especially in the absence of labeled data. This study explores anomaly detection in a waste transport network using unsupervised learning, enhanced by limited supervision and enriched with network structure information. Initially, unsupervised models like Isolation Forest, K-Means, LOF, and Autoencoders were applied using statistical and graph-based features. These models detected outliers without prior labels. Later, information on a few confirmed anomalous users enabled weak supervision, guiding feature selection through statistical tests like Kolmogorov-Smirnov and Anderson-Darling. Results show that models trained on a reduced, graph-focused feature set improved anomaly detection, particularly under extreme class imbalance. Isolation Forest notably ranked known anomalies highly. Ego network visualizations supported these findings, demonstrating the value of integrating structural features and limited labels for identifying subtle, relational anomalies.
2025
Autores
Pinheiro, AP; Ribeiro, RP;
Publicação
CoRR
Abstract
2025
Autores
Paim, AM; Gama, J; Veloso, B; Enembreck, F; Ribeiro, RP;
Publicação
40TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING
Abstract
The learning from continuous data streams is a relevant area within machine learning, focusing on the creation and updating of predictive models in real time as new data becomes available for training and prediction. Among the most widely used methods for this type of task, Hoeffding Trees are highly valued for their simplicity and robustness across a variety of applications and are considered the primary choice for generating decision trees in data stream contexts. However, Hoeffding Trees tend to continuously expand as new data is incorporated, resulting in increased processing time and memory consumption, often without providing significant gains in accuracy. In this study, we propose an instance selection scheme that combines different strategies to regularize Hoeffding Trees and their variants, mitigating excessive growth without compromising model accuracy. The method selects misclassified instances and a fraction of correctly classified instances during the training phase. After extensive experimental evaluation, the instance selection scheme demonstrates superior predictive performance compared to the original models (without selection), for both real and synthetic datasets for data streams, using a reduced subset of examples. Additionally, the method achieves relevant improvements in processing time, model complexity, and memory consumption, highlighting the effectiveness of the proposed instance selection scheme.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.