Publications

Publications by Vasco Vieira Costa

2025

Edge-enabled distributed digital twins with embedded intelligence for smart aquaculture systems

Authors
Costa, D; Rocha, EM; Costa, V; Rocha, MM; Marques, C;

Publication
JOURNAL OF AMBIENT INTELLIGENCE AND SMART ENVIRONMENTS

Abstract
Aquaculture is the world's fastest-growing food production sector, yet it lags behind other industries in adopting upcoming digital technologies. Challenges, such as integrating multimodal data and maintaining reliable network connectivity, have hindered the development of digital twins for monitoring aquaculture systems. This paper addresses these challenges through two main contributions: (i) a novel edge-based architecture for digital twinning that enables distributed, localized monitoring and actuation, reducing dependence on centralized systems and robust networks; and (ii) a three-stage algorithmic approach for mortality monitoring tailored to edge computing environments. This approach enables early detection of rising mortality rates using data fused from diverse sources, including directly monitored environmental parameters (e.g. pH and temperature), and novel optical biosensors that make use of lightweight computer vision and machine learning techniques for the estimation of bacterial concentrations within edge devices. The algorithmic strategy was tested in a real-world recirculating aquaculture system for Solea senegalensis, where bacterial concentration was estimated with an F1-score of 0.83 across five concentration levels using biosensor imagery. Moreover, a multimodal drift detection algorithm successfully identified abnormal data trends aligned with significant changes in input distributions, with preemptive drift signals preceding critical 7-day mortality spikes.

CloseRead Abstract

2024

Robust mortality prediction on a recirculating aquaculture system

Authors
Costa, V; Rocha, E; Marques, C;

Publication
REVIEW OF SCIENTIFIC INSTRUMENTS

Abstract
Aquaculture presents itself as one of the most rapidly developing means of sustainable production of animal protein to feed ever-growing populations. Recirculating aquaculture systems offer higher control and fewer inconveniences than traditional systems, making them an attractive option for fish production. Although the sector's digitalization is in its early stages, its application should increase its rentability while conserving the environment. This paper aims to promote the sector's evolution by assessing parameter importance in mortality with tree-based machine learning models, verifying the method's natural robustness and how it compares to a specially devised one, and at the same time evaluating the concept's relevance in predicting categorical mortality values. In particular, to better understand the aquaculture production process through a systematic data evaluation, an exploration based on real-time data acquisition is fully needed. Moreover, algorithm robustness is a key ingredient in this application since measurements are greatly affected by errors. This invalidates the application of traditional machine learning methods, where models are sensitive to production data variations and sensor noise. The study found the parameters that play relevant roles in the production phases, such as pH and nitrate concentration. While the obtained predictive metrics are still sub-optimal, further enhancements could be achieved through rigorous analysis of feature engineering, fine-tuning model hyperparameters, and exploring more advanced algorithms. Additionally, incorporating larger and more diverse datasets, refining data pre-processing techniques, and iteratively optimizing the model architecture may contribute to significant improvements in predictive performance. Despite that, the impact costs of using adjusted machine learning metrics are clear, as are the importance of data rounding in pre-processing and directions for improvement regarding data acquisition and transformation.

CloseRead Abstract

2025

A new parametric information-gain criterion for tree-based machine learning algorithms

Authors
Costa, D; Costa, VV; Rocha, E;

Publication
PEERJ COMPUTER SCIENCE

Abstract
Decision Trees (DTs) remain one of the most important algorithms in machine learning for their simplicity, interpretability, and often satisfactory performance. Furthermore, they are critical foundational components for more performant models such as Random Forests (RFs) and Gradient Boosted Trees. Central to DTs is the splitting process, where data is partitioned according to criteria traditionally based on information-theoretic measures such as Shannon entropy or Gini index. In this article, we propose a novel parametric entropy-based information gain criterion designed to generalize and extend classical entropic measures to improve classification performance in DTs and RFs. We introduce a five-parameter entropy formulation capable of replicating and extending known entropy measures. This new criterion was incorporated into DT and RF classifiers and evaluated on a collection of 18 benchmarking datasets, including both synthetic and real-world data retrieved from publicly available repositories. Performance was assessed using 5-fold cross-validation and optimized via Bayesian hyperparameter search, with weighted F1-score as the primary metric. Compared to splitting criteria based on existing entropy/purity measures (e.g., Gini, Shannon, R & eacute;nyi, and Tsallis), our method yielded statistically significant improvements in classification performance across most datasets. On multiclass and imbalanced datasets, such as the Wine Quality dataset, F1-score improvements exceeded 40% using RF algorithms. Bayesian signed-rank tests confirmed the robustness of our method, which never underperformed relative to standard approaches. The proposed entropy-based splitting criterion offers a flexible and effective alternative to classical information-theoretic measures, delivering improvements in classification performance.

CloseRead Abstract

2025

A Smart Tool to Unlock Hidden Insights in Industrial Data by Leveraging EDA, LLM, Conformal Prediction, and AutoML

Authors
Costa, V; Costa, D; Rocha, M;

Publication
Procedia Computer Science

Abstract
Rising competitiveness and client requirements make effective use of high volume and complexity real-time industrial data crucial for faster decision-making. However, this potential is hindered by a lack of smart, user-friendly analytic tools for all collaborators. Despite the proliferation of Machine Learning (ML) tools for data scientists, non-experts struggle with converting data into actionable insights and identifying profitable data science projects. A smart tool is thus proposed, allowing non-experts to perform preliminary data evaluations through profiled analysis pathways that execute predefined sets of Exploratory Data Analysis (EDA) methods and ML operations. Further assisting users, the tool solely relies on metadata attributes and textual descriptions of datasets enhanced by interaction with a Large Language Model (LLM). This paper examines profile selection stages, replacing traditional ML methods with Conformal Prediction (CP) techniques. CP identifies multiple potential prospects with statistical confidence and recognizes when correct predictions are impossible. Trials with task-labeled metadata files (derived from publicly available datasets) showed that while classic ML methods had about 80% efficiency, CP techniques improved the selection process, keeping profiling errors below 0.06 with 99% confidence. This approach enables the correct identification (with statistical confidence) of appropriate analysis profiles for data science problems, thus paving the way for more efficient data analysis tools in industrial settings, accessible to users of all skill levels. © 2024 The Authors. Published by Elsevier B.V.

CloseRead Abstract