Publications

Publications by LIAAD

2007

Does SVM really scale up to large bag of words feature spaces?

Authors
Colas, F; Paclik, P; Kok, JN; Brazdil, P;

Publication
ADVANCES IN INTELLIGENT DATA ANALYSIS VII, PROCEEDINGS

Abstract
We are concerned with the problem of learning classification rules in text categorization where many authors presented Support Vector Machines (SVM) as leading classification method. Number of studies, however, repeatedly pointed out that in some situations SVM is outperformed by simpler methods such as naive Bayes or nearest-neighbor rule. In this paper, we aim at developing better understanding of SVM behaviour in typical text categorization problems represented by sparse bag of words feature spaces. We study in details the performance and the number of support vectors when varying the training set size, the number of features and, unlike existing studies, also SVM free parameter C, which is the Lagrange multipliers upper bound in SVM dual. We show that SVM solutions with small C are high performers. However, most training documents are then bounded support vectors sharing a same weight C. Thus, SVM reduce to a nearest mean classifier-, this raises an interesting question on SVM merits in sparse bag of words feature spaces. Additionally, SVM suffer from performance deterioration for particular training set size/number of features combinations.

CloseRead Abstract

2007

Cost-sensitive decision trees applied to medical data

Authors
Freitas, A; Costa Pereira, A; Brazdil, P;

Publication
DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS

Abstract
Classification plays an important role in medicine, especially for medical diagnosis. Health applications often require classifiers that minimize the total cost, including misclassifications costs and test costs. In fact, there are many reasons for considering costs in medicine, as diagnostic tests are not free and health budgets are limited. Our aim with this work was to define, implement and test a strategy for cost-sensitive learning. We defined an algorithm for decision tree induction that considers costs, including test costs, delayed costs and costs associated with risk. Then we applied our strategy to train and evaluate cost-sensitive decision trees in medical data. Built trees can be tested following some strategies, including group costs, common costs, and individual costs. Using the factor of "risk" it is possible to penalize invasive or delayed tests and obtain decision trees patient-friendly.

CloseRead Abstract

2007

A putative gene located at the MHC class I region around the D6S105 marker contributes to the setting of CD8+T-lymphocyte numbers in humans

Authors
Vieira, J; Cardoso, CS; Pinto, J; Patil, K; Brazdil, P; Cruz, E; Mascarenhas, C; Lacerda, R; Gartner, A; Almeida, S; Alves, H; Porto, G;

Publication
INTERNATIONAL JOURNAL OF IMMUNOGENETICS

Abstract
Significant associations between human leucocyte antigen (HLA)-A and -B alleles and CD8+ T-lymphocyte numbers have been reported in the literature in both healthy populations and in HFE-haemochromatosis patients. In order to address whether HLA alleles themselves or alleles at linked genes are responsible for these associations, several genetic markers at the MHC class I region were typed on a population of 147 apparently healthy unrelated subjects phenotypically characterized for their CD8+ and CD4+ T-lymphocyte numbers. By using a machine learning approach, a set of rules was generated that predict the number of CD8+ T-lymphocyte numbers on the basis of the information of the D6S105 microsatellite alleles only. We demonstrate that the previously reported associations with HLA-A and -B alleles are due to the presence of common long (up to 4 megabases long) haplotypes that increased in frequency recently due to positive selection and that encompass a region where a putative gene contributing to the setting of CD8+ T lymphocytes is located, in the neighbourhood of microsatellite locus D6S105, in the 6p21.3 region.

CloseRead Abstract

2007

Location of a putative gene contributing to the setting of CD8+T lymphocytes: A modifier of hereditary hemochromatosis expression?

Authors
Vieira, J; Cardoso, CS; Pinto, J; Patil, K; Brazdil, P; Cruz, E; Mascarenhas, C; Lacerda, R; Gartner, A; Almeida, S; Alves, H; Porto, G;

Publication
AMERICAN JOURNAL OF HEMATOLOGY

Abstract

2007

Strategic versus tactical nature of sales promotions

Authors
Brito, PQ; Hammond, K;

Publication
Journal of Marketing Communications

Abstract
Sales promotions (SP) are short-term instruments usually designed to yield an immediate sales effect. Previous research has suggested that SP can be seen as detrimental to a brand's consumer franchise/equity as, in the long term, SP deteriorates brand value. In this paper, we theoretically broaden the scope of SP research relation to the following topics: strategy concept, marketing strategy, the Integrated Marketing Communication (IMC) concept, the specific nature of each SP instruments and the underlying processes associated with consumer uptake of SP. We present findings that illustrate managers' perceptions of the positioning of SP instruments. We argue that the strategic nature of SP needs to be incorporated into marketers' research agendas.

CloseRead Abstract

2007

Distributed generative data mining

Authors
Ramos, R; Camacho, R;

Publication
ADVANCES IN DATA MINING: THEORETICAL ASPECTS AND APPLICATIONS, PROCEEDINGS

Abstract
A process of Knowledge Discovery in Databases (KDD) involving large amounts of data requires a considerable amount of computational power. The process may be done on a dedicated and expensive machinery or, for some tasks, one can use distributed computing techniques on a network of affordable machines. In either approach it is usual the user to specify the workflow of the sub-tasks composing the whole KDD process before execution starts. In this paper we propose a technique that we call Distributed Generative Data Mining. The generative feature of the technique is due to its capability of generating new sub-tasks of the Data Mining analysis process at execution time. The workflow of sub-tasks of the DM is, therefore, dynamic. To deploy the proposed technique we extended the Distributed Data Mining system HARVARD and adapted an Inductive Logic Programming system (IndLog) used in a Relational Data Ming task. As a proof-of-concept, the extended system was used to analyse an artificial dataset of a credit scoring problem with eighty million records.

CloseRead Abstract