2015
Authors
Zarmehri, MN; Soares, C;
Publication
2015 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)
Abstract
Traditionally, a single model is developed for a data mining task. As more data is being collected at a more detailed level, organizations are becoming more interested in having specific models for distinct parts of data (e. g. customer segments). From the business perspective, data can be divided naturally into different dimensions. Each of these dimensions is usually hierarchically organized (e. g. country, city, zip code), which means that, when developing a model for a given part of the problem (e. g. a zip code) the training data may be collected at different levels of this nested hierarchy (e. g. the same zip code, the city and the country it is located in). Selecting different levels of granularity may change the performance of the whole process, so the question is which level to use for a given part. We propose a metalearning model which recommends a level of granularity for the training data to learn the model that is expected to obtain the best performance. We apply decision tree and random forest algorithms for metalearning. At the base level, our experiment uses results obtained by outlier detection methods on the problem of detecting errors in foreign trade transactions. The results show that using metalearning help finding the best level of granularity.
2015
Authors
Brito, PQ; Soares, C; Almeida, S; Monte, A; Byvoet, M;
Publication
ROBOTICS AND COMPUTER-INTEGRATED MANUFACTURING
Abstract
Data mining (DM) techniques have been used to solve marketing and manufacturing problems in the fashion industry. These approaches are expected to be particularly important for highly customized industries because the diversity of products sold makes it harder to find clear patterns of customer preferences. The goal of this project was to investigate two different data mining approaches for customer segmentation: clustering and subgroup discovery. The models obtained produced six market segments and 49 rules that allowed a better understanding of customer preferences in a highly customized fashion manufacturer/e-tailor. The scope and limitations of these clustering DM techniques will lead to further methodological refinements.
2015
Authors
Pinto, F; Soares, C; Mendes Moreira, J;
Publication
MULTIPLE CLASSIFIER SYSTEMS (MCS 2015)
Abstract
Ensemble learning algorithms often benefit from pruning strategies that allow to reduce the number of individuals models and improve performance. In this paper, we propose a Metalearning method for pruning bagging ensembles. Our proposal differs from other pruning strategies in the sense that allows to prune the ensemble before actually generating the individual models. The method consists in generating a set characteristics from the bootstrap samples and relate them with the impact of the predictive models in multiple tested combinations. We executed experiments with bagged ensembles of 20 and 100 decision trees for 53 UCI classification datasets. Results show that our method is competitive with a state-of-the-art pruning technique and bagging, while using only 25% of the models.
2015
Authors
Rebelo, F; Soares, C; Rossetti, RJF;
Publication
2015 IEEE FIRST INTERNATIONAL SMART CITIES CONFERENCE (ISC2)
Abstract
In the early twenty-first century, social networks served only to let the world know our tastes, share our photos and share some thoughts. A decade later, these services are filled with an enormous amount of information. Now, the industry and the academia are exploring this information, in order to extract implicit patterns. TwitterJam is a tool that analyses the contents of the social network Twitter to extract events related to road traffic. To reach this goal, we started by analysing tweets to know those which really contains road traffic information. The second step was to gather official information to confirm the extracted information. With these two types of information (official and general), we correlated them in order to verify the credibility of public tweets. The correlation between the two types of information was done separately in two ways: the first one concerns the amount of tweets in a certain time of day and the second on the localization of these tweets. Two hypothesis were also devised concerning these correlations. The results were not perfect but where reasonable enough. We also analysed tools suitable for the visualization of data to decide what is the best strategy to follow. At the end we developed a web application that shows the results, to help the analysis of results.
2015
Authors
Zarmehri, MN; Soares, C;
Publication
Advances in Intelligent Data Analysis XIV
Abstract
Trip duration is an important metric for the management of taxi companies, as it affects operational efficiency, driver satisfaction and, above all, customer satisfaction. In particular, the ability to predict trip duration in advance can be very useful for allocating taxis to stands and finding the best route for trips. A data mining approach can be used to generate models for trip time prediction. In fact, given the amount of data available, different models can be generated for different taxis. Given the difference between the data collected by different taxis, the best model for each one can be obtained with different algorithms and/or parameter settings. However, finding the configuration that generates the best model for each taxi is computationally very expensive. In this paper, we propose the use of metalearning to address the problem of selecting the algorithm that generates the model with the most accurate predictions for each taxi. The approach is tested on data collected in the Drive-In project. Our results show that metalearning can help to select the algorithm with the best accuracy.
2015
Authors
Da Costa, JP; Roque, LAC; Soares, C;
Publication
STATISTICS & PROBABILITY LETTERS
Abstract
A new weighted rank correlation coefficient r(W2) has been introduced in Pinto da Costa (2011), following the coefficient r(W) introduced in Pinto Da Costa and Soares (2005); Soares et al. (2001); Pinto Da Costa et al. (2001). We give the expression of r(W2) in the case of ties and also present some simulations to study the behaviour of the coefficient.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.