Publications

Publications by LIAAD

2025

Edge-enabled distributed digital twins with embedded intelligence for smart aquaculture systems

Authors
Costa, D; Rocha, EM; Costa, V; Rocha, MM; Marques, C;

Publication
JOURNAL OF AMBIENT INTELLIGENCE AND SMART ENVIRONMENTS

Abstract
Aquaculture is the world's fastest-growing food production sector, yet it lags behind other industries in adopting upcoming digital technologies. Challenges, such as integrating multimodal data and maintaining reliable network connectivity, have hindered the development of digital twins for monitoring aquaculture systems. This paper addresses these challenges through two main contributions: (i) a novel edge-based architecture for digital twinning that enables distributed, localized monitoring and actuation, reducing dependence on centralized systems and robust networks; and (ii) a three-stage algorithmic approach for mortality monitoring tailored to edge computing environments. This approach enables early detection of rising mortality rates using data fused from diverse sources, including directly monitored environmental parameters (e.g. pH and temperature), and novel optical biosensors that make use of lightweight computer vision and machine learning techniques for the estimation of bacterial concentrations within edge devices. The algorithmic strategy was tested in a real-world recirculating aquaculture system for Solea senegalensis, where bacterial concentration was estimated with an F1-score of 0.83 across five concentration levels using biosensor imagery. Moreover, a multimodal drift detection algorithm successfully identified abnormal data trends aligned with significant changes in input distributions, with preemptive drift signals preceding critical 7-day mortality spikes.

CloseRead Abstract

2025

A new parametric information-gain criterion for tree-based machine learning algorithms

Authors
Costa, D; Costa, VV; Rocha, E;

Publication
PEERJ COMPUTER SCIENCE

Abstract
Decision Trees (DTs) remain one of the most important algorithms in machine learning for their simplicity, interpretability, and often satisfactory performance. Furthermore, they are critical foundational components for more performant models such as Random Forests (RFs) and Gradient Boosted Trees. Central to DTs is the splitting process, where data is partitioned according to criteria traditionally based on information-theoretic measures such as Shannon entropy or Gini index. In this article, we propose a novel parametric entropy-based information gain criterion designed to generalize and extend classical entropic measures to improve classification performance in DTs and RFs. We introduce a five-parameter entropy formulation capable of replicating and extending known entropy measures. This new criterion was incorporated into DT and RF classifiers and evaluated on a collection of 18 benchmarking datasets, including both synthetic and real-world data retrieved from publicly available repositories. Performance was assessed using 5-fold cross-validation and optimized via Bayesian hyperparameter search, with weighted F1-score as the primary metric. Compared to splitting criteria based on existing entropy/purity measures (e.g., Gini, Shannon, R & eacute;nyi, and Tsallis), our method yielded statistically significant improvements in classification performance across most datasets. On multiclass and imbalanced datasets, such as the Wine Quality dataset, F1-score improvements exceeded 40% using RF algorithms. Bayesian signed-rank tests confirmed the robustness of our method, which never underperformed relative to standard approaches. The proposed entropy-based splitting criterion offers a flexible and effective alternative to classical information-theoretic measures, delivering improvements in classification performance.

CloseRead Abstract

2025

A Smart Tool to Unlock Hidden Insights in Industrial Data by Leveraging EDA, LLM, Conformal Prediction, and AutoML

Authors
Costa, V; Costa, D; Rocha, M;

Publication
Procedia Computer Science

Abstract
Rising competitiveness and client requirements make effective use of high volume and complexity real-time industrial data crucial for faster decision-making. However, this potential is hindered by a lack of smart, user-friendly analytic tools for all collaborators. Despite the proliferation of Machine Learning (ML) tools for data scientists, non-experts struggle with converting data into actionable insights and identifying profitable data science projects. A smart tool is thus proposed, allowing non-experts to perform preliminary data evaluations through profiled analysis pathways that execute predefined sets of Exploratory Data Analysis (EDA) methods and ML operations. Further assisting users, the tool solely relies on metadata attributes and textual descriptions of datasets enhanced by interaction with a Large Language Model (LLM). This paper examines profile selection stages, replacing traditional ML methods with Conformal Prediction (CP) techniques. CP identifies multiple potential prospects with statistical confidence and recognizes when correct predictions are impossible. Trials with task-labeled metadata files (derived from publicly available datasets) showed that while classic ML methods had about 80% efficiency, CP techniques improved the selection process, keeping profiling errors below 0.06 with 99% confidence. This approach enables the correct identification (with statistical confidence) of appropriate analysis profiles for data science problems, thus paving the way for more efficient data analysis tools in industrial settings, accessible to users of all skill levels. © 2024 The Authors. Published by Elsevier B.V.

CloseRead Abstract

2024

Process mining embeddings: Learning vector representations for Petri nets

Authors
Colonna, JG; Fares, AA; Duarte, M; Sousa, R;

Publication
INTELLIGENT SYSTEMS WITH APPLICATIONS

Abstract
Process Mining offers a powerful framework for uncovering, analyzing, and optimizing real-world business processes. Petri nets provide a versatile means of modeling process behavior. However, traditional methods often struggle to effectively compare complex Petri nets, hindering their potential for process enhancement. To address this challenge, we introduce PetriNet2Vec, an unsupervised methodology inspired by Doc2Vec. This approach converts Petri nets into embedding vectors, facilitating the comparison, clustering, and classification of process models. We validated our approach using the PDC Dataset, comprising 96 diverse Petri net models. The results demonstrate that PetriNet2Vec effectively captures the structural properties of process models, enabling accurate process classification and efficient process retrieval. Specifically, our findings highlight the utility of the learned embeddings in two key downstream tasks: process classification and process retrieval. In process classification, the embeddings allowed for accurate categorization of process models based on their structural properties. In process retrieval, the embeddings enabled efficient retrieval of similar process models using cosine distance. These results demonstrate the potential of PetriNet2Vec to significantly enhance process mining capabilities.

CloseRead Abstract

2024

Optimal gas subset selection for dissolved gas analysis in power transformers

Authors
Pinto, J; Esteves, V; Tavares, S; Sousa, R;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE

Abstract
The power transformer is one of the key components of any electrical grid, and, as such, modern day industrialization activities require constant usage of the asset. This increases the possibility of failures and can potentially diminish the lifespan of a power transformer. Dissolved gas analysis (DGA) is a technique developed to quantify the existence of hydrocarbon gases in the content of the power transformer oil, which in turn can indicate the presence of faults. Since this process requires different chemical analysis for each type of gas, the overall cost of the operation increases with number of gases. Thus said, a machine learning methodology was defined to meet two simultaneous objectives, identify gas subsets, and predict the remaining gases, thus restoring them. Two subsets of equal or smaller size to those used by traditional methods (Duval's triangle, Roger's ratio, IEC table) were identified, while showing potentially superior performance. The models restored the discarded gases, and the restored set was compared with the original set in a variety of validation tasks.

CloseRead Abstract

2024

Estimating the Likelihood of Financial Behaviours Using Nearest Neighbors A case study on market sensitivities

Authors
Mendes Neves, T; Seca, D; Sousa, R; Ribeiro, C; Mendes Moreira, J;

Publication
COMPUTATIONAL ECONOMICS

Abstract
As many automated algorithms find their way into the IT systems of the banking sector, having a way to validate and interpret the results from these algorithms can lead to a substantial reduction in the risks associated with automation. Usually, validating these pricing mechanisms requires human resources to manually analyze and validate large quantities of data. There is a lack of effective methods that analyze the time series and understand if what is currently happening is plausible based on previous data, without information about the variables used to calculate the price of the asset. This paper describes an implementation of a process that allows us to validate many data points automatically. We explore the K-Nearest Neighbors algorithm to find coincident patterns in financial time series, allowing us to detect anomalies, outliers, and data points that do not follow normal behavior. This system allows quicker detection of defective calculations that would otherwise result in the incorrect pricing of financial assets. Furthermore, our method does not require knowledge about the variables used to calculate the time series being analyzed. Our proposal uses pattern matching and can validate more than 58% of instances, substantially improving human risk analysts' efficiency. The proposal is completely transparent, allowing analysts to understand how the algorithm made its decision, increasing the trustworthiness of the method.

CloseRead Abstract