Publications

Publications by LIAAD

2021

An analysis of Monte Carlo simulations for forecasting software projects

Authors
Miranda, P; Faria, JP; Correia, FF; Fares, A; Graça, R; Moreira, JM;

Publication
SAC '21: The 36th ACM/SIGAPP Symposium on Applied Computing, Virtual Event, Republic of Korea, March 22-26, 2021

Abstract
Forecasts of the effort or delivery date can play an important role in managing software projects, but the estimates provided by development teams are often inaccurate and time-consuming to produce. This is not surprising given the uncertainty that underlies this activity. This work studies the use of Monte Carlo simulations for generating forecasts based on project historical data. We have designed and run experiments comparing these forecasts against what happened in practice and to estimates provided by developers, when available. Comparisons were made based on the mean magnitude of relative error (MMRE). We did also analyze how the forecasting accuracy varies with the amount of work to be forecasted and the amount of historical data used. To minimize the requirements on input data, delivery date forecasts for a set of user stories were computed based on takt time of past stories (time elapsed between the completion of consecutive stories); effort forecasts were computed based on full-time equivalent (FTE) hours allocated to the implementation of past stories. The MMRE of delivery date forecasting was 32% in a set of 10 runs (for different projects) of Monte Carlo simulation based on takt time. The MMRE of effort forecasting was 20% in a set of 5 runs of Monte Carlo simulation based on FTE allocation, much smaller than the MMRE of 134% of developers' estimates. A better forecasting accuracy was obtained when the number of historical data points was 20 or higher. These results suggest that Monte Carlo simulations may be used in practice for delivery date and effort forecasting in agile projects, after a few initial sprints. © 2021 ACM.

CloseRead Abstract

2021

Benchmark of Encoders of Nominal Features for Regression

Authors
Seca, D; Moreira, JM;

Publication
Trends and Applications in Information Systems and Technologies - Volume 1, WorldCIST 2021, Terceira Island, Azores, Portugal, 30 March - 2 April, 2021.

Abstract
Mixed-type data is common in the real world. However, supervised learning algorithms such as support vector machines or neural networks can only process numerical features. One may choose to drop qualitative features, at the expense of possible loss of information. A better alternative is to encode them as new numerical features. Under the constraints of time, budget, and computational resources, we were motivated to search for a general-purpose encoder but found the existing benchmarks to be limited. We review these limitations and present an alternative. Our benchmark tests 16 encoding methods, on 15 regression datasets, using 7 distinct predictive models. The top general-purpose encoders were found to be Catboost, LeaveOneOut, and Target. © 2021, The Author(s), under exclusive license to Springer Nature Switzerland AG.

CloseRead Abstract

2021

An Analysis of the State of the Art of Machine Learning for Risk Assessment in Software Projects (S)

Authors
Sousa, A; Faria, JP; Moreira, JM;

Publication
The 33rd International Conference on Software Engineering and Knowledge Engineering, SEKE 2021, KSIR Virtual Conference Center, USA, July 1 - July 10, 2021.

Abstract
Risk management is one of the ten knowledge areas discussed in the Project Management Body of Knowledge (PMBOK), which serves as a guide that should be followed to increase the chances of project success. The popularity of research regarding the application of risk management in software projects has been consistently growing in recent years, particularly with the application of machine learning techniques to help identify risk levels or risk factors of a project before the project development begins, with the intent of improving the likelihood of success of software projects. This paper provides an overview of various concepts related to risk and risk management in software projects, including traditional techniques used to identify and control risks in software projects, as well as machine learning techniques and methods which have been applied to provide better estimates and classification of the risk levels and risk factors that can be encountered during the development of a software project. The paper also presents an analysis of machine learning oriented risk management studies and experiments found in the literature as a way of identifying the type of inputs and outputs, as well as frequent algorithms used in this research area.

CloseRead Abstract

2021

Transportation Mode Detection from GPS data: A Data Science Benchmark study

Authors
Muhammad, AR; Aguiar, A; Mendes Moreira, J;

Publication
2021 IEEE INTELLIGENT TRANSPORTATION SYSTEMS CONFERENCE (ITSC)

Abstract
Understanding the distribution of people's transportation mode is a crucial facet of today's urban mobility for proper transportation planning. The penetration of smartphones combined with their sensing capability is an enabler for crowdsourcing large mobility data such as commuters' GPS records. In this paper, we leverage the GPS traces of commuters to infer five different transportation modes frequently used in urban areas including foot, bike, bus, car and metro. We compare three different approaches commonly reported in the literature for transportation mode detection from the family of machine learning algorithms (random forest -RF) and deep learning architectures (convolutional neural network -CNN and ensemble of autoencoders -EAE). By splitting the dataset into train-test by the period of data collection, as well as the conventional 80-20 split, we evaluate the impact of several data pre-processing decisions on overall classifiers' performance. Our results show RF and CNN performing better upon evaluation on classification metrics such as the f1 score and the area under the Receiver Operating Characteristics (ROC) curve.

CloseRead Abstract

2021

A Data-Driven Simulator for Assessing Decision-Making in Soccer

Authors
Mendes-Neves, T; Mendes-Moreira, J; Rossetti, RJF;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2021)

Abstract
Decision-making is one of the crucial factors in soccer (association football). The current focus is on analyzing data sets rather than posing what if questions about the game. We propose simulation-based methods that allow us to answer these questions. To avoid simulating complex human physics and ball interactions, we use data to build machine learning models that form the basis of an event-based soccer simulator. This simulator is compatible with the OpenAI GYM API. We introduce tools that allow us to explore and gather insights about soccer, like (1) calculating the risk/reward ratios for sequences of actions, (2) manually defining playing criteria, and (3) discovering strategies through Reinforcement Learning.

CloseRead Abstract

2021

Applying Machine Learning to Risk Assessment in Software Projects

Authors
Sousa, A; Faria, JP; Mendes-Moreira, J; Gomes, D; Henriques, PC; Graca, R;

Publication
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, PT II

Abstract
Risk management is one of the ten knowledge areas discussed in the Project Management Body of Knowledge (PMBOK), which serves as a guide that should be followed to increase the chances of project success. The popularity of research regarding the application of risk management in software projects has been consistently growing in recent years, especially with the application of machine learning techniques to help identify risk levels of risk factors of a project before its development begins, with the goal of improving the likelihood of success of these projects. This paper presents the results of the application of machine learning techniques for risk assessment in software projects. A Python application was developed and, using Scikit-learn, two machine learning models, trained using software project risk data shared by a partner company of this project, were created to predict risk impact and likelihood levels on a scale of 1 to 3. Different algorithms were tested to compare the results obtained by high performance but non-interpretable algorithms (e.g., Support Vector Machine) and the ones obtained by interpretable algorithms (e.g., Random Forest), whose performance tends to be lower than their non-interpretable counterparts. The results showed that Support Vector Machine and Naive Bayes were the best performing algorithms. Support Vector Machine had an accuracy of 69% in predicting impact levels, and Naive Bayes had an accuracy of 63% in predicting likelihood levels, but the results presented in other evaluation metrics (e.g., AUC, Precision) show the potential of the approach presented in this use case.

CloseRead Abstract