Publications

Publications by Pavel Brazdil

2007

Does SVM really scale up to large bag of words feature spaces?

Authors
Colas, F; Paclik, P; Kok, JN; Brazdil, P;

Publication
ADVANCES IN INTELLIGENT DATA ANALYSIS VII, PROCEEDINGS

Abstract
We are concerned with the problem of learning classification rules in text categorization where many authors presented Support Vector Machines (SVM) as leading classification method. Number of studies, however, repeatedly pointed out that in some situations SVM is outperformed by simpler methods such as naive Bayes or nearest-neighbor rule. In this paper, we aim at developing better understanding of SVM behaviour in typical text categorization problems represented by sparse bag of words feature spaces. We study in details the performance and the number of support vectors when varying the training set size, the number of features and, unlike existing studies, also SVM free parameter C, which is the Lagrange multipliers upper bound in SVM dual. We show that SVM solutions with small C are high performers. However, most training documents are then bounded support vectors sharing a same weight C. Thus, SVM reduce to a nearest mean classifier-, this raises an interesting question on SVM merits in sparse bag of words feature spaces. Additionally, SVM suffer from performance deterioration for particular training set size/number of features combinations.

CloseRead Abstract

2007

Cost-sensitive decision trees applied to medical data

Authors
Freitas, A; Costa Pereira, A; Brazdil, P;

Publication
DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS

Abstract
Classification plays an important role in medicine, especially for medical diagnosis. Health applications often require classifiers that minimize the total cost, including misclassifications costs and test costs. In fact, there are many reasons for considering costs in medicine, as diagnostic tests are not free and health budgets are limited. Our aim with this work was to define, implement and test a strategy for cost-sensitive learning. We defined an algorithm for decision tree induction that considers costs, including test costs, delayed costs and costs associated with risk. Then we applied our strategy to train and evaluate cost-sensitive decision trees in medical data. Built trees can be tested following some strategies, including group costs, common costs, and individual costs. Using the factor of "risk" it is possible to penalize invasive or delayed tests and obtain decision trees patient-friendly.

CloseRead Abstract

2004

Improving progressive sampling via meta-learning on learning curves

Authors
Leite, R; Brazdil, P;

Publication
MACHINE LEARNING: ECML 2004, PROCEEDINGS

Abstract
This paper describes a method that can be seen as an improvement of, the standard progressive sampling. The standard method uses samples of data of increasing size until accuracy of the learned concept cannot be further improved. The issue we have addressed here is how to avoid using some of the samples in this progression. The paper presents a method for predicting the stopping point using a meta-learning approach. The method requires just four iterations of the progressive sampling. The information gathered is used to identify the nearest learning curves, for which the sampling procedure was carried out fully. This in turn permits to generate the prediction regards the stopping point. Experimental evaluation shows that the method can lead to significant savings of time without significant losses of accuracy.

CloseRead Abstract

2003

Improving progressive sampling via meta-learning

Authors
Leite, R; Brazdil, P;

Publication
PROGRESS IN ARTIFICIAL INTELLIGENCE

Abstract
We present a method that can be seen as an improvement of standard progressive sampling method. The method exploits information concerning performance of a given algorithm on past datasets, which is used to generate predictions of the stopping point. Experimental evaluation shows that the method can lead to significant time savings without significant losses in accuracy.

CloseRead Abstract

2010

Meta-Learning - Concepts and Techniques

Authors
Vilalta, R; Carrier, CGG; Brazdil, P;

Publication
Data Mining and Knowledge Discovery Handbook, 2nd ed.

Abstract

2006

On the behavior of SVM and some older algorithms in binary text classification tasks

Authors
Colas, F; Brazdil, P;

Publication
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS

Abstract
Document classification has already been widely studied. In fact, some studies compared feature selection techniques or feature space transformation whereas some others compared the performance of different algorithms. Recently, following the rising interest towards the Support Vector Machine, various studies showed that the SVM outperforms other classification algorithms. So should we just not bother about other classification algorithms and opt always for SVM? We have decided to investigate this issue and compared SVM to kNN and naive Bayes on binary classification tasks. An important issue is to compare optimized versions of these algorithms, which is what we have done. Our results show all the classifiers achieved comparable performance on most problems. One surprising result is that SVM was not a clear winner, despite quite good overall performance. If a suitable preprocessing is used with kNN, this algorithm continues to achieve very good results and scales up well with the number of documents, which is not the case for SVM. As for naive Bayes, it also achieved good performance.

CloseRead Abstract