Details
Name
Luís CaviqueRole
External Research CollaboratorSince
12th March 2025
Nationality
PortugalCentre
Human-Centered Computing and Information ScienceContacts
+351222094000
luis.cavique@inesctec.pt
2026
Authors
António, F; Cavique, L;
Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2025, PT I
Abstract
Sales forecasting in the presence of Missing Data poses significant challenges, particularly for short time series where limited observations amplify the impact of incomplete records. This study analyzes a real-world transactional dataset (2021-2024) to predict quantities and prices for 2025. We classify missingness patterns and mechanisms (MCAR, MAR, MNAR) to inform the selection of imputation strategies. We evaluate techniques including MICE, Mean, KNN, and Linear Regression under simulated missingness rates, with KNN emerging as the most robust for the MAR mechanism. Regarding very short-term series predictions, the naive forecast Max2 (maximum of the last two observed values) outperformed moving averages. The results highlight the importance of mechanismaware imputation and domain-tailored forecasting in sparse datasets. This work presents a practical framework for businesses to effectively utilize incomplete sales data.
2026
Authors
Alcalde, DD; Bugarim, D; Coelho, T; Almeida, E; Silva, C; Cavique, L; Dias Ferreira, C;
Publication
DATA IN BRIEF
Abstract
The dataset reports an up-to-date overview of the selective biowaste collection with a focus on food waste and organic kitchen waste across 308 municipalities in Portugal, to assess the compliance with the EU Waste Framework Directive that made biowaste collection mandatory from 1st January 2024. Data were collected through a structured survey sent to the totality of the municipalities, complemented by systematic research in secondary official sources such as municipal web-sites, reports and statistical data. The questionnaire covered aspects such as coverage, collection models (nearby bring points, door-to-door, co-collection), sector-specific deployment (household collection, non-domestic collection), operational characteristics, and performance indicators (capture rates, cost per tonne). The dataset was structured and validated through cross-checking the multiple sources assessed, prioritising direct municipal questionnaire responses. It includes disaggregated data at a municipality level, including detailed information on the characteristics and efficiency of the initiatives, when available. The database allows the cross-comparison across Portuguese regions and potentially with other international systems, in terms of biowaste collection strategies with focus on food waste and organic kitchen waste. Municipalities in Portugal have been carrying out pilot experiences within their territories, but there is no systematic assessment of what has been carried out nor the results obtained. Given the limited available data, this dataset provides a valuable resource for policy design and further research on biowaste management initiatives to further assess their efficiency and adaptability to different municipal realities at a national and even European level. (c) 2025 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
2025
Authors
Vasconcelos, MO; Cavique, L;
Publication
INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS
Abstract
The growing use of machine learning for integrity assessments in public administration has intensified interest in understanding how algorithms can detect corruption risk-a topic of increasing relevance in the context of rising demands for transparency. Previous research on fraud detection often overlooks the dual challenge of extreme class imbalance and the need for model explainability. This study addresses both issues by combining data-level and algorithm-level techniques in a real-world dataset from Brazil's Federal District, where there is one corruption case for every 707 non-corruption cases (a ratio of 1:707). Data engineering was essential, encompassing gathering, cleaning, transformation, and dimensionality reduction to enhance model performance and interpretability. Among the tested models, weighted logistic regression stood out, achieving the best AUC (0.692). To increase transparency, we employed SHapley Additive exPlanations, enabling both global and local interpretability of predictions. The analysis identified strong predictors of corruption risk, such as business ownership, political candidacy, and frequent job function changes. This work provides a replicable pipeline that integrates imbalanced learning and explainable AI, offering valuable contributions to risk management and decision-making in the public sector.
2025
Authors
Pinheiro, P; Cavique, L;
Publication
Decision Analytics Journal
Abstract
In uplift modeling, the goal is to identify high-value customers based on persuadable customers, those who make a purchase only if contacted. To achieve this, uplift modeling combines machine learning techniques with causal inference, allowing businesses to refine their customer targeting strategies and focus efforts where they are most profitable. This study proposes a practical and reproducible two-phase procedure for identifying high-value customers. In the first phase, customers are segmented using decision trees, which offer a transparent and data-driven approach to grouping individuals with similar characteristics. This segmentation lays the groundwork for a meaningful interpretation of customer behavior. In the second phase, uplift is calculated for each customer segment by comparing the outcomes of the treatment and control groups. This enables the identification of customer groups with the highest uplift. A real-world use case further illustrates the value and applicability of the proposed method. To validate model performance, the procedure employs established metrics such as the Qini index and Cohen's kappa, which provide insights into both the effectiveness and reliability of the uplift estimates. This work presents a decoupled procedure for uplift modeling that leverages well-established libraries, fostering transparency and a clear understanding of the analytical process. A key contribution to uplift modeling and causal inference is the use of decision trees for stratification, which enables the creation of meaningful segments and their evaluation through the average treatment effect. By integrating theory with practical implementation, this work offers a comprehensive framework for uplift modeling that bridges academic rigor and business usability. © 2025 Elsevier B.V., All rights reserved.
2025
Authors
Vasconcelos, M; Cavique, L;
Publication
EXPERT SYSTEMS WITH APPLICATIONS
Abstract
Imbalanced datasets present a challenge in machine learning, especially in binary classification scenarios where one class significantly outweighs the other. This imbalance often leads to models favoring the majority class, resulting in inadequate predictions for the minority class, specifically in false negatives. In response to this issue, this work introduces the MinFNR ensemble algorithm, designed to minimize False Negative Rates (FNR) in imbalanced datasets. The new approach strategically combines data-level, algorithmic-level, and hybrid-level approaches to enhance overall predictive capabilities while minimizing computational resources using the Set Covering Problem (SCP) formulation. Through a comprehensive evaluation of diverse datasets, MinFNR consistently outperforms individual algorithms, showing its potential for applications where the cost of false negatives is substantial, such as fraud detection and medical diagnosis. This work also contributes to ongoing efforts to improve the reliability and effectiveness of machine learning algorithms in real imbalanced scenarios.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.