Publications

Publications by Vítor Manuel Cerqueira

2017

A Comparative Study of Performance Estimation Methods for Time Series Forecasting

Authors
Cerqueira, V; Torgo, L; Smailovic, J; Mozetic, I;

Publication
2017 IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA)

Abstract
Performance estimation denotes a task of estimating the loss that a predictive model will incur on unseen data. These procedures are part of the pipeline in every machine learning task and are used for assessing the overall generalisation ability of models. In this paper we address the application of these methods to time series forecasting tasks. For independent and identically distributed data the most common approach is cross-validation. However, the dependency among observations in time series raises some caveats about the most appropriate way to estimate performance in these datasets and currently there is no settled way to do so. We compare different variants of cross-validation and different variants of out-of-sample approaches using two case studies: One with 53 real-world time series and another with three synthetic time series. Results show noticeable differences in the performance estimation methods in the two scenarios. In particular, empirical experiments suggest that cross-validation approaches can be applied to stationary synthetic time series. However, in real-world scenarios the most accurate estimates are produced by the out-of-sample methods, which preserve the temporal order of observations.

CloseRead Abstract

2016

Automated Setting of Bus Schedule Coverage Using Unsupervised Machine Learning

Authors
Khiari, J; Matias, LM; Cerqueira, V; Cats, O;

Publication
Advances in Knowledge Discovery and Data Mining - 20th Pacific-Asia Conference, PAKDD 2016, Auckland, New Zealand, April 19-22, 2016, Proceedings, Part I

Abstract
The efficiency of Public Transportation (PT) Networks is a major goal of any urban area authority. Advances on both location and communication devices drastically increased the availability of the data generated by their operations. Adequate Machine Learning methods can thus be applied to identify patterns useful to improve the Schedule Plan. In this paper, the authors propose a fully automated learning framework to determine the best Schedule Coverage to be assigned to a given PT network based on Automatic Vehicle location (AVL) and Automatic Passenger Counting (APC) data. We formulate this problem as a clustering one, where the best number of clusters is selected through an ad-hoc metric. This metric takes into account multiple domain constraints, computed using Sequence Mining and Probabilistic Reasoning. A case study from a large operator in Sweden was selected to validate our methodology. Experimental results suggest necessary changes on the Schedule coverage. Moreover, an impact study was conducted through a large-scale simulation over the affected time period. Its results uncovered potential improvements of the schedule reliability on a large scale. © Springer International Publishing Switzerland 2016.

CloseRead Abstract

2016

CJAMmer - traffic JAM Cause Prediction using Boosted Trees

Authors
Matias, LM; Cerqueira, V;

Publication
19th IEEE International Conference on Intelligent Transportation Systems, ITSC 2016, Rio de Janeiro, Brazil, November 1-4, 2016

Abstract
A traffic incident is defined by an event which provokes a disruption on the normal (free) flow condition of any highway. Such incidents must be caused by a recurrent excessive demand or, in alternative, by a series of possible stochastic occurrences which may suddenly reduce the road capacity (e.g. car accidents, extreme weather changes). This paper proposes a novel binary supervised learning method to classify congestion predictions regarding their causes - CJAMmer. It leverages on heterogeneous and ubiquitous data sources - such as weather, flow counts and traffic incident event logs -To generalize decision models able to understand the road congestion nature. CJAMmer settles on boosted decision trees using the well-known C4.5, as well as a straightforward feature generation process. A real world experiment was used to compare this method against other state-of-The-Art classifiers. The results uncovered the high potential impact of this methodology on industrial scale traffic control systems. © 2016 IEEE.

CloseRead Abstract

2018

How to evaluate sentiment classifiers for Twitter time-ordered data?

Authors
Mozetic, I; Torgo, L; Cerqueira, V; Smailovic, J;

Publication
PLOS ONE

Abstract
Social media are becoming an increasingly important source of information about the public mood regarding issues such as elections, Brexit, stock market, etc. In this paper we focus on sentiment classification of Twitter data. Construction of sentiment classifiers is a standard text mining task, but here we address the question of how to properly evaluate them as there is no settled way to do so. Sentiment classes are ordered and unbalanced, and Twitter produces a stream of time-ordered data. The problem we address concerns the procedures used to obtain reliable estimates of performance measures, and whether the temporal ordering of the training and test data matters. We collected a large set of 1.5 million tweets in 13 European languages. We created 138 sentiment models and out-of-sample datasets, which are used as a gold standard for evaluations. The corresponding 138 in-sample data-sets are used to empirically compare six different estimation procedures: three variants of cross-validation, and three variants of sequential validation (where test set always follows the training set). We find no significant difference between the best cross-validation and sequential validation. However, we observe that all cross-validation variants tend to overestimate the performance, while the sequential methods tend to underestimate it. Standard cross-validation with random selection of examples is significantly worse than the blocked cross-validation, and should not be used to evaluate classifiers in time-ordered data scenarios.

CloseRead Abstract

2019

Constructive Aggregation and Its Application to Forecasting with Dynamic Ensembles

Authors
Cerqueira, V; Pinto, F; Torgo, L; Soares, C; Moniz, N;

Publication
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2018, PT I

Abstract
While the predictive advantage of ensemble methods is nowadays widely accepted, the most appropriate way of estimating the weights of each individual model remains an open research question. Meanwhile, several studies report that combining different ensemble approaches leads to improvements in performance, due to a better trade-off between the diversity and the error of the individual models in the ensemble. We contribute to this research line by proposing an aggregation framework for a set of independently created forecasting models, i.e. heterogeneous ensembles. The general idea is to, instead of directly aggregating these models, first rearrange them into different subsets, creating a new set of combined models which is then aggregated into a final decision. We present this idea as constructive aggregation, and apply it to time series forecasting problems. Results from empirical experiments show that applying constructive aggregation to state of the art dynamic aggregation methods provides a consistent advantage. Constructive aggregation is publicly available in a software package. Data related to this paper are available at: https://github.com/vcerqueira/timeseriesdata. Code related to this paper is available at: https://github. com/vcerqueira/tsensembler.

CloseRead Abstract

2018

SMOTEBoost for Regression: Improving the Prediction of Extreme Values

Authors
Moniz, N; Ribeiro, RP; Cerqueira, V; Chawla, N;

Publication
2018 IEEE 5TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA)

Abstract
Supervised learning with imbalanced domains is one of the biggest challenges in machine learning. Such tasks differ from standard learning tasks by assuming a skewed distribution of target variables, and user domain preference towards under-represented cases. Most research has focused on imbalanced classification tasks, where a wide range of solutions has been tested. Still, little work has been done concerning imbalanced regression tasks. In this paper, we propose an adaptation of the SMOTEBoost approach for the problem of imbalanced regression. Originally designed for classification tasks, it combines boosting methods and the SMOTE resampling strategy. We present four variants of SMOTEBoost and provide an experimental evaluation using 30 datasets with an extensive analysis of results in order to assess the ability of SMOTEBoost methods in predicting extreme target values, and their predictive trade-off concerning baseline boosting methods. SMOTEBoost is publicly available in a software package.

CloseRead Abstract