Publications

Publications by LIAAD

2006

ODAC: Hierarchical Clustering of Time Series Data Streams

Authors
Rodrigues, PP; Gama, J; Pedroso, JP;

Publication
PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING

Abstract
This paper presents a time series whole clustering system that incrementally constructs a tree-like hierarchy of clusters, using a top-down strategy. The Online Divisive-Agglomerative Clustering (ODAC) system uses a correlation-based dissimilarity measure between time series over a data stream and possesses an agglomerative phase to enhance a dynamic behavior capable of concept drift detection. Main features include splitting and agglomerative criteria based on the diameters of existing clusters and supported by a. significance level. At each new example, only the leaves are updated, reducing computation of unneeded dissimilarities and speeding up the process every time the structure grows. Experimental results on artificial and real data suggest competitive performance on clustering time series and show that the system is equivalent to a batch divisive clustering on stationary time series, being also capable of dealing with concept drift. With this work, we assure the possibility and importance of hierarchical incremental time series whole clustering in the data stream paradigm, presenting a. valuable and usable option.

CloseRead Abstract

2006

A pipelined data-parallel algorithm for ILP

Authors
Fonseca, NA; Silva, F; Costa, VS; Camacho, R;

Publication
2005 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER)

Abstract
The amount of data collected and stored in databases is growing considerably for almost all areas of human activity. Processing this amount of data is very expensive, both humanly and computationally. This justifies the increased interest both on the automatic discovery of useful knowledge from databases, and on using parallel processing for this task. Multi Relational Data Mining (MRDM) techniques, such as Inductive Logic Programming (ILP), can learn rules from relational databases consisting of multiple tables. However current ILP systems are designed to run in main memory and can have long running times. We propose a pipelined data-parallel algorithm for ILP. The algorithm was implemented and evaluated on a commodity PC cluster with 8 processors. The results show that our algorithm yields excellent speedups, while preserving the quality of learning.

CloseRead Abstract

2006

April - An inductive logic programming system

Authors
Fonseca, NA; Silva, F; Camacho, R;

Publication
LOGICS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS

Abstract
Inductive Logic Programming (ILP) is a Machine Learning research field that has been quite successful in knowledge discovery in relational domains. ILP systems use a set of pre-classified examples (positive and negative) and prior knowledge to learn a theory in which positive examples succeed and the negative examples fail. In this paper we present a novel ILP system called April, capable of exploring several parallel strategies in distributed and shared memory machines.

CloseRead Abstract

2006

Comparison of SVM and some older classification algorithms in text classification tasks

Authors
Colas, F; Brazdil, P;

Publication
ARTIFICIAL INTELLIGENCE IN THEORY AND PRACTICE

Abstract
Document classification has already been widely studied. In fact, some studies compared feature selection techniques or feature space transformation whereas some others compared the performance of different algorithms. Recently, following the rising interest towards the Support Vector Machine, various studies showed that SVM outperforms other classification algorithms. So should we just not bother about other classification algorithms and opt always for SVM ? We have decided to investigate this issue and compared SVM to kNN and naive Bayes on binary classification tasks. An important issue is to compare optimized versions of these algorithms, which is what we have done. Our results show all the classifiers achieved comparable performance on most problems. One surprising result is that SVM was not a clear winner, despite quite good overall performance. If a suitable preprocessing is used with kNN, this algorithm continues to achieve very good results and scales up well with the number of documents, which is not the case for SVM. As for naive Bayes, it also achieved good performance.

CloseRead Abstract

2006

On the behavior of SVM and some older algorithms in binary text classification tasks

Authors
Colas, F; Brazdil, P;

Publication
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS

Abstract
Document classification has already been widely studied. In fact, some studies compared feature selection techniques or feature space transformation whereas some others compared the performance of different algorithms. Recently, following the rising interest towards the Support Vector Machine, various studies showed that the SVM outperforms other classification algorithms. So should we just not bother about other classification algorithms and opt always for SVM? We have decided to investigate this issue and compared SVM to kNN and naive Bayes on binary classification tasks. An important issue is to compare optimized versions of these algorithms, which is what we have done. Our results show all the classifiers achieved comparable performance on most problems. One surprising result is that SVM was not a clear winner, despite quite good overall performance. If a suitable preprocessing is used with kNN, this algorithm continues to achieve very good results and scales up well with the number of documents, which is not the case for SVM. As for naive Bayes, it also achieved good performance.

CloseRead Abstract

2006

Quantitative pharmacophore models with inductive logic programming

Authors
Srinivasan, A; Page, D; Camacho, R; King, R;

Publication
MACHINE LEARNING

Abstract
Three-dimensional models, or pharmacophores, describing Euclidean constraints on the location on small molecules of functional groups (like hydrophobic groups, hydrogen acceptors and donors, etc.), are often used in drug design to describe the medicinal activity of potential drugs (or 'ligands'). This medicinal activity is produced by interaction of the functional groups on the ligand with a binding site on a target protein. In identifying structure-activity relations of this kind there are three principal issues: (1) It is often difficult to "align" the ligands in order to identify common structural properties that may be responsible for activity; (2) Ligands in solution can adopt different shapes (or 'conformations') arising from torsional rotations about bonds. The 3-D molecular substructure is typically sought on one or more low-energy conformers; and (3) Pharmacophore models must, ideally, predict medicinal activity on some quantitative scale. It has been shown that the logical representation adopted by Inductive Logic Programming (ILP) naturally resolves many of the difficulties associated with the alignment and multi-conformation issues. However, the predictions of models constructed by ILP have hitherto only been nominal, predicting medicinal activity to be present or absent. In this paper, we investigate the construction of two kinds of quantitative pharmacophoric models with ILP: (a) Models that predict the probability that a ligand is "active"; and (b) Models that predict the actual medicinal activity of a ligand. Quantitative predictions are obtained by the utilising the following statistical procedures as background knowledge: logistic regression and naive Bayes, for probability prediction; linear and kernel regression, for activity prediction. The multi-conformation issue and, more generally, the relational representation used by ILP results in some special difficulties in the use of any statistical procedure. We present the principal issues and some solutions. Specifically, using data on the inhibition of the protease Thermolysin, we demonstrate that it is possible for an ILP program to construct good quantitative structure-activity models. We also comment on the relationship of this work to other recent developments in statistical relational learning.

CloseRead Abstract