Publicacoes - INESC TEC

Publicações

Publicações por Luís Cavique

2020

A bi-objective procedure to deliver actionable knowledge in sport services

Autores
Pinheiro, P; Cavique, L;

Publicação
EXPERT SYSTEMS

Abstract
The increase in retention of customers in gyms and health clubs is nowadays a challenge that requires concrete and personalized actions. Traditional data mining studies focused essentially on predictive analytics, neglecting the business domain. This work presents an actionable knowledge discovery system that uses the following pipeline (data collection, predictive model and retention interventions). In the first step, it extracts and transforms existing real data from databases of the sports facilities. In the second step, predictive models are applied to identify user profiles more susceptible to dropout, where actionable withdrawal rules are based on actionable attributes. Finally, in the third step, based on the previous actionable knowledge, some of the values of the actionable attributes should be changed in order to increase retention. Simulation of scenarios is carried out, with test and control groups, where business utility and associated cost are measured. This document presents a bi-objective study in order to choose the more efficient scenarios.

FecharLer Abstract

2020

Supply-Demand Matrix: A Process-Oriented Approach for Data Warehouses with Constellation Schemas

Autores
Cavique, L; Cavique, M; Santos, JMA;

Publicação
TRENDS AND INNOVATIONS IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 1

Abstract
Star schema in data warehouses is a very well established model. However, the increasing number of star schemas creating large constellations schemas add new challenges in the organizations. In this document, we intend to make a contribution in the technical architecture of data warehouses with constellation schemas using an extension of the bus matrix. The proposed supply-demand matrix details the raw data from the original databases, describes the constellation schemas with different dimensions and establishes the information demand requirements. © 2020, The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG.

FecharLer Abstract

2020

Addressing Low Dimensionality Feature Subset Selection: ReliefF(-k) or Extended Correlation-Based Feature Selection(eCFS)?

Autores
Tallon Ballesteros, AJ; Cavique, L; Fong, S;

Publicação
14TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING MODELS IN INDUSTRIAL AND ENVIRONMENTAL APPLICATIONS (SOCO 2019)

Abstract
This paper tackles problems where attribute selection is not only able to choose a few features but also to achieve a low performance classification in terms of accuracy compared to the full attribute set. Correlation-based feature selection (CFS) has been set as the baseline attribute subset selector due to its popularity and high performance. Around hundred data sets have been collected and submitted to CFS; then the problems fulfilling simultaneously the conditions: (a) a number of selected attributes lower than six and (b) a percentage of selected attributes lower than a forty per cent, have been tested onto two directions. Firstly, in the scope of data selection at the feature level, an advanced contemporary approach have been conducted as well as some options proposed in a prior work. Secondly, the pre-processed and initial problems have been tested with some sturdy classifiers. Moreover, this work introduces a new taxonomy of feature selection according to the solution type and the followed way to compute it. The test bed comprises seven problems featured by a low dimensionality after the CFS application, three out of them report a single selected attribute, another one with two extracted features and the three remaining data sets with four or five retained attributes; additionally, the initial feature set is between six and twenty-nine and the complexity of the problems, in terms of classes, fluctuates between two and twenty-one, throwing averages of sixteen and around five for both aforementioned properties. The contribution concluded that the advanced procedure (extended CFS) is suitable for problems where only one or two attributes are selected by CFS; for data sets with more than two selected features the baseline method is preferable to the advanced one, although the considered feature ranking method achieved intermediate results.

FecharLer Abstract

2021

Multi-Attribute Forecast of the Price in the Iberian Electricity Market

Autores
Peres, G; Tallón Ballesteros, AJ; Cavique, L;

Publicação
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract
Electricity has been acquiring a more significant presence in our lives, and it is estimated that the future will be increasingly electric. Nowadays, we have access to enormous amounts of data that do not have much-added value if they cannot support decision-making or plan systems in advance and correctly. Forecasts are vital tools to support decision-making. We believe it is possible to resort to open data available on the Internet to make electricity price forecasts that - decision-makers can use in the sector. In this work, we study the multi-attribute hourly forecast of the electricity price in MIBEL (Iberian electricity market) for the 24 h of the following day, using open data. The realization of the multi-attribute predictions fell on the TIM (‘Tangent Information Modeler’) tool with AutoML (‘Auto Machine Learning’) capabilities. The TOPSIS (‘technique for order of preference by similarity to ideal solution’) decision support technique was used to analyze the results. © 2021, Springer Nature Switzerland AG.

FecharLer Abstract

2021

Imbalanced Learning in Assessing the Risk of Corruption in Public Administration

Autores
Vasconcelos, MO; Chaim, RM; Cavique, L;

Publicação
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2021)

Abstract
This research aims to identify the corruption of the civil servants in the Federal District, Brazilian Public Administration. For this purpose, a predictive model was created integrating data from eight different systems and applying logistic regression to real datasets that, by their nature, present a low percentage of examples of interest in identifying patterns for machine learning, a situation defined as a class imbalance. In this study, the imbalance of classeswas considered extreme at a ratio of 1:707 or, in percentage terms, 0.14% of the interest class to the population. Two possible approaches were used, balancing with resampling techniques using synthetic minority oversampling techniqueSMOTEand applying algorithms with specific parameterization to obtain the desired standards of the minority classwithout generating bias from the dominant class. The best modeling resultwas obtained by applying it to the second approach, generating an area value on the ROC curve of around 0.69. Based on sixty-eight features, the respective coefficients that correspond to the risk factors for corruption were found. A subset of twenty features is discussed in order to find practical utility after the discovery process.

FecharLer Abstract

2021

Regular sports services: Dataset of demographic, frequency and service level agreement

Autores
Pinheiro, P; Cavique, L;

Publicação
DATA IN BRIEF

Abstract
This article describes a dataset of different services acquired by users during the period in which they are active in a sports facility as well as their behavior in terms of frequency of the sport facility itself and the type of classes they prefer to attend. Each observation in the dataset corresponds to one user, including the features of subscriptions and frequency. Data were collected between June 1st 2014 and October 31st 2019 from a database of an ERP solution operating in a sports facility in Lisbon, Portugal. From this database, it was possible to perform operations of extraction, transformation and loading into the dataset. The dataset with real data can be useful for research in areas such as customer retention, machine learning, marketing, actionable knowledge and others. Although we present real data from users of a sports facility, in order to comply the GDPR legislation, the attributes that could identify the users were removed making the data anonymized. (C) 2021 The Author(s). Published by Elsevier Inc.

FecharLer Abstract