2010
Autores
Torgo, L; Soares, C;
Publicação
Data Mining for Business Applications
Abstract
This paper describes a methodology for the application of hierarchical clustering methods to the task of outlier detection. The methodology is tested on the problem of cleaning Official Statistics data. The goal is to detect erroneous foreign trade transactions in data collected by the Portuguese Institute of Statistics (INE). These transactions are a minority, but still they have an important impact on the statistics produced by the institute. The detectiong of these rare errors is a manual, time-consuming task. This type of tasks is usually constrained by a limited amount of available resources. Our proposal addresses this issue by producing a ranking of outlyingness that allows a better management of the available resources by allocating them to the cases which are most different from the other and, thus, have a higher probability of being errors. Our method is based on the output of standard agglomerative hierarchical clustering algorithms, resulting in no significant additional computational costs. Our results show that it enables large savings by selecting a small subset of suspicious transactions for manual inspection, which, nevertheless, includes most of the erroneous transactions. In this study we compare our proposal to a state of the art outlier ranking method (LOF) and show that our method achieves better results on this particular application. The results of our experiments are also competitive with previous results on the same data. Finally, the outcome of our experiments raises important questions concerning the method currently followed at INE concerning items with small number of transactions.
1998
Autores
Gama, J; Torgo, L; Soares, C;
Publicação
PROGRESS IN ARTIFICIAL INTELLIGENCE-IBERAMIA 98
Abstract
Discretization of continuous attributes is an important task for certain types of machine learning algorithms. Bayesian approaches, for instance, require assumptions about data distributions. Decision Trees on the other hand, require sorting operations to deal with continuous attributes, which largely increase learning times. This paper presents a new method of discretization, whose main characteristic is that it takes into account interdependencies between attributes. Detecting interdependencies can be seen as discovering redundant attributes. This means that our method performs attribute selection as a side effect of the discretization. Empirical evaluation on five benchmark datasets from UCI repository, using C4.5 and a naive Bayes, shows a consistent reduction of the features without loss of generalization accuracy.
2009
Autores
Torgo, L; Pereira, W; Soares, C;
Publicação
PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS
Abstract
This paper describes a data mining approach to the problem of detecting erroneous foreign trade transactions in data collected by the Portuguese Institute of Statistics (INE). Erroneous transactions are a minority, but still they have an important impact: on the official statistics produced by INE. Detecting these rare errors is a manual, time-consuming task, which is constrained by a limited amount of available resources (e.g. financial, human). These constraints are common to many other data analysis problems (e.g. fraud detection). Our previous work addresses this issue by producing a ranking of outlyingness that allows a better management of the available resources by allocating them to the most, relevant cases. It is based on an adaptation of hierarchical clustering methods for outlier detection. However, the method cannot be applied to articles with a small number of transactions. In this paper, we complement the previous approach with some standard statistical methods for outlier detection for handling articles with few transactions. Our experiments clearly show its advantages in terms of the criteria, outlined by INE for considering any method applicable to this business problem. The generality of the approach remains to be tested in other problems which share the same constraints (e.g. fraud detection).
2008
Autores
Brito, P; Figueiredo, A; Pires, A; Ferreira, AS; Marcelo, C; Figueiredo, F; Sousa, F; Da Costa, JP; Pereira, J; Torgo, L; Castro, LCE; Silva, ME; Milheiro, P; Teles, P; Campos, P; Silva, PD;
Publicação
COMPSTAT 2008 - Proceedings in Computational Statistics, 18th Symposium
Abstract
2008
Autores
Ribeiro, R; Torgo, L;
Publicação
ECOLOGICAL MODELLING
Abstract
Algae blooms are ecological events associated with extremely high abundance value of certain algae. These rare events have a strong impact in the river's ecosystem. In this context, the prediction of such events is of special importance. This paper addresses the problems that result from evaluating and comparing models at the prediction of rare extreme values using standard evaluation statistics. In this context, we describe a new evaluation statistic that we have proposed in Torgo and Ribeiro [Torgo, L., Ribeiro, R., 2006. Predicting rare extreme values. In: Ng, W, Kitsuregawa, M., Li, J., Chang, K. (Eds.), Proceedings of the loth Pacific-Asia Conference on Knowledge Discover and Data Mining (PAKDD'2006). Springer, pp. 816-820 (number 3918 in LNAI)], which can be used to identify the best models for predicting algae blooms. We apply this new statistic in a comparative study involving several models for predicting the abundance of different groups of phytoplankton in water samples collected in Douro River, Porto, Portugal. Results show that the proposed statistic identifies a variant of a Support Vector Machine as outperforming the other models that were tried in the prediction of algae blooms.
2007
Autores
Torgo, L; Ribeiro, R;
Publicação
Knowledge Discovery in Databases: PKDD 2007, Proceedings
Abstract
Cost-sensitive learning is a key technique for addressing many real world data mining applications. Most existing research has been focused on classification problems. In this paper we propose a framework for evaluating regression models in applications with non-uniform costs and benefits across the domain of the continuous target variable. Namely, we describe two metrics for asserting the costs and benefits of the predictions of any model given a set of test cases. We illustrate the use of our metrics in the context of a specific type of applications where non-uniform costs are required: the prediction of rare extreme values of a continuous target variable. Our experiments provide clear evidence of the utility of the proposed framework for evaluating the merits of any model in this class of regression domains.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.