Publicacoes - INESC TEC

Publicações

Publicações por Nuno Fonseca

2008

Compile the Hypothesis Space: Do it Once, Use it Often

Autores
Fonseca, NA; Camacho, R; Rocha, R; Costa, VS;

Publicação
FUNDAMENTA INFORMATICAE

Abstract
Inductive Logic Programming (ILP) is a powerful and well-developed abstraction for multi-relational data mining techniques. Despite the considerable success of ILP, deployed ILP systems still have efficiency problems when applied to complex problems. In this paper we propose a novel technique that avoids the procedure of deducing each example to evaluate each constructed clause. The technique is based on the Mode Directed Inverse Entailment approach to ILP, where a bottom clause is generated for each example and the generated clauses are subsets of the literals of such bottom clause. We propose to store in a prefix-tree all clauses that can be generated from all bottom clauses together with some extra information. We show that this information is sufficient to estimate the number of examples that can be deduced from a clause and present an ILP algorithm that exploits this representation. We also present an extension of the algorithm where each prefix-tree is computed only once (compiled) per example. The evaluation of hypotheses requires only basic and efficient operations on trees. This proposal avoids re-computation of hypothesis' value in theory-level search, in cross-validation evaluation procedures and in parameter tuning. Both proposals are empirically evaluated on real applications and considerable speedups were observed.

FecharLer Abstract

2006

A pipelined data-parallel algorithm for ILP

Autores
Fonseca, NA; Silva, F; Costa, VS; Camacho, R;

Publicação
2005 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER)

Abstract
The amount of data collected and stored in databases is growing considerably for almost all areas of human activity. Processing this amount of data is very expensive, both humanly and computationally. This justifies the increased interest both on the automatic discovery of useful knowledge from databases, and on using parallel processing for this task. Multi Relational Data Mining (MRDM) techniques, such as Inductive Logic Programming (ILP), can learn rules from relational databases consisting of multiple tables. However current ILP systems are designed to run in main memory and can have long running times. We propose a pipelined data-parallel algorithm for ILP. The algorithm was implemented and evaluated on a commodity PC cluster with 8 processors. The results show that our algorithm yields excellent speedups, while preserving the quality of learning.

FecharLer Abstract

2006

April - An inductive logic programming system

Autores
Fonseca, NA; Silva, F; Camacho, R;

Publicação
LOGICS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS

Abstract
Inductive Logic Programming (ILP) is a Machine Learning research field that has been quite successful in knowledge discovery in relational domains. ILP systems use a set of pre-classified examples (positive and negative) and prior knowledge to learn a theory in which positive examples succeed and the negative examples fail. In this paper we present a novel ILP system called April, capable of exploring several parallel strategies in distributed and shared memory machines.

FecharLer Abstract

2009

BIORED - A Genetic Algorithm for Pattern Detection in Biosequences

Autores
Pereira, P; Silva, F; Fonseca, NA;

Publicação
2ND INTERNATIONAL WORKSHOP ON PRACTICAL APPLICATIONS OF COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (IWPACBB 2008)

Abstract
We present a new, efficient and scalable tool, named BIORED, for pattern discovery in proteomic and genomic sequences. It uses a genetic algorithm to find interesting patterns in the form of regular expressions, and a new efficient pattern matching procedure to count pattern occurrences. We studied the performance, scalability and usefulness of BIORED using several databases of biosequences. The results show that BIORED was successful in finding previously known patterns, thus an excellent indicator for its potential. BIORED is available for download under the GNU Public License at http://www.dcc.fc.up.pt/bi-ored/. An online demo is available at the same address.

FecharLer Abstract

2009

Improving the efficiency of inductive logic programming systems

Autores
Fonseca, NA; Costa, VS; Rocha, R; Camacho, R; Silva, F;

Publicação
SOFTWARE-PRACTICE & EXPERIENCE

Abstract
Inductive logic programming (ILP) is a sub-field of machine learning that provides an excellent framework for multi-relational data mining applications. The advantages of ILP have been successfully demonstrated in complex and relevant industrial and scientific problems. However, to produce valuable models, ILP systems often require long running times and large amounts of memory. In this paper we address fundamental issues that have direct impact on the efficiency of ILP systems. Namely, we discuss how improvements in the indexing mechanisms of an underlying logic programming system benefit ILP performance. Furthermore, we propose novel data structures to reduce memory requirements and we suggest a new lazy evaluation technique to search the hypothesis space more efficiently. These proposals have been implemented in the April ILP system and evaluated using several well-known data sets. The results observed show significant improvements in running time without compromising the accuracy of the models generated. Indeed, the combined techniques achieve several order of magnitudes speedup in some data sets. Moreover, memory requirements are reduced in nearly half of the data sets. Copyright (C) 2008 John Wiley & Sons, Ltd.

FecharLer Abstract

2009

Parallel ILP for distributed-memory architectures

Autores
Fonseca, NA; Srinivasan, A; Silva, F; Camacho, R;

Publicação
MACHINE LEARNING

Abstract
The growth of machine-generated relational databases, both in the sciences and in industry, is rapidly outpacing our ability to extract useful information from them by manual means. This has brought into focus machine learning techniques like Inductive Logic Programming (ILP) that are able to extract human-comprehensible models for complex relational data. The price to pay is that ILP techniques are not efficient: they can be seen as performing a form of discrete optimisation, which is known to be computationally hard; and the complexity is usually some super-linear function of the number of examples. While little can be done to alter the theoretical bounds on the worst-case complexity of ILP systems, some practical gains may follow from the use of multiple processors. In this paper we survey the state-of-the-art on parallel ILP. We implement several parallel algorithms and study their performance using some standard benchmarks. The principal findings of interest are these: (1) of the techniques investigated, one that simply constructs models in parallel on each processor using a subset of data and then combines the models into a single one, yields the best results; and (2) sequential (approximate) ILP algorithms based on randomized searches have lower execution times than (exact) parallel algorithms, without sacrificing the quality of the solutions found.

FecharLer Abstract