Publications

Publications by Vítor Santos Costa

2013

BigYAP: Exo-compilation meets UDI

Authors
Costa, VS; Vaz, D;

Publication
THEORY AND PRACTICE OF LOGIC PROGRAMMING

Abstract
The widespread availability of large data-sets poses both an opportunity and a challenge to logic programming. A first approach is to couple a relational database with logic programming, say, a Prolog system with MySQL. While this approach does pay off in cases where the data cannot reside in main memory, it is known to introduce substantial overheads. Ideally, we would like the Prolog system to deal with large data-sets in an efficient way both in terms of memory and of processing time. Just In Time Indexing (JITI) was mainly motivated by this challenge, and can work quite well in many application. Exo-compilation, designed to deal with large tables, is a next step that achieves very interesting results, reducing the memory footprint over two thirds. We show that combining exo-compilation with Just In Time Indexing can have significant advantages both in terms of memory usage and in terms of execution time. An alternative path that is relevant for many applications is User-Defined Indexing (UDI). This allows the use of specialized indexing for specific applications, say the spatial indexing crucial to any spatial system. The UDI sees indexing as pluggable modules, and can naturally be combined with Exo-compilation. We do so by using UDI with exo-data, and incorporating ideas from the UDI into high-performance indexers for specific tasks.

CloseRead Abstract

2015

Human pluripotent stem cell-derived neural constructs for predicting neural toxicity

Authors
Schwartz, MP; Hou, ZG; Propson, NE; Zhang, J; Engstrom, CJ; Costa, VS; Jiang, P; Nguyen, BK; Bolin, JM; Daly, W; Wang, Y; Stewart, R; Page, CD; Murphy, WL; Thomson, JA;

Publication
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA

Abstract
Human pluripotent stem cell-based in vitro models that reflect human physiology have the potential to reduce the number of drug failures in clinical trials and offer a cost-effective approach for assessing chemical safety. Here, human embryonic stem (ES) cell-derived neural progenitor cells, endothelial cells, mesenchymal stem cells, and microglia/macrophage precursors were combined on chemically defined polyethylene glycol hydrogels and cultured in serum-free medium to model cellular interactions within the developing brain. The precursors self-assembled into 3D neural constructs with diverse neuronal and glial populations, interconnected vascular networks, and ramified microglia. Replicate constructs were reproducible by RNA sequencing (RNA-Seq) and expressed neurogenesis, vasculature development, and microglia genes. Linear support vector machines were used to construct a predictive model from RNA-Seq data for 240 neural constructs treated with 34 toxic and 26 nontoxic chemicals. The predictive model was evaluated using two standard hold-out testing methods: a nearly unbiased leave-one-out cross-validation for the 60 training compounds and an unbiased blinded trial using a single hold-out set of 10 additional chemicals. The linear support vector produced an estimate for future data of 0.91 in the cross-validation experiment and correctly classified 9 of 10 chemicals in the blinded trial.

CloseRead Abstract

2014

PrologCheck - Property-Based Testing in Prolog

Authors
Amaral, C; Florido, M; Costa, VS;

Publication
FUNCTIONAL AND LOGIC PROGRAMMING, FLOPS 2014

Abstract
We present PrologCheck, an automatic tool for property-based testing of programs in the logic programming language Prolog with randomised test data generation. The tool is inspired by the well known QuickCheck, originally designed for the functional programming language Haskell. It includes features that deal with specific characteristics of Prolog such as its relational nature (as opposed to Haskell) and the absence of a strong type discipline. PrologCheck expressiveness stems from describing properties as Prolog goals. It enables the definition of custom test data generators for random testing tailored for the property to be tested. Further, it allows the use of a predicate specification language that supports types, modes and constraints on the number of successful computations. We evaluate our tool on a number of examples and apply it successfully to debug a Prolog library for AVL search trees.

CloseRead Abstract

2014

Support Vector Machines for Differential Prediction

Authors
Kuusisto, F; Costa, VS; Nassif, H; Burnside, ES; Page, D; Shavlik, JW;

Publication
ECML/PKDD (2)

Abstract
Machine learning is continually being applied to a growing set of fields, including the social sciences, business, and medicine. Some fields present problems that are not easily addressed using standard machine learning approaches and, in particular, there is growing interest in differential prediction. In this type of task we are interested in producing a classifier that specifically characterizes a subgroup of interest by maximizing the difference in predictive performance for some outcome between subgroups in a population. We discuss adapting maximum margin classifiers for differential prediction. We first introduce multiple approaches that do not affect the key properties of maximum margin classifiers, but which also do not directly attempt to optimize a standard measure of differential prediction. We next propose a model that directly optimizes a standard measure in this field, the uplift measure. We evaluate our models on real data from two medical applications and show excellent results. © 2014 Springer-Verlag.

CloseRead Abstract

2014

Relational machine learning for electronic health record-driven phenotyping

Authors
Peissig, PL; Costa, VS; Caldwell, MD; Rottscheit, C; Berg, RL; Mendonca, EA; Page, D;

Publication
JOURNAL OF BIOMEDICAL INFORMATICS

Abstract
Objective: Electronic health records (EHR) offer medical and pharmacogenomics research unprecedented opportunities to identify and classify patients at risk. EHRs are collections of highly inter-dependent records that include biological, anatomical, physiological, and behavioral observations. They comprise a patient's clinical phenome, where each patient has thousands of date-stamped records distributed across many relational tables. Development of EHR computer-based phenotyping algorithms require time and medical insight from clinical experts, who most often can only review a small patient subset representative of the total EHR records, to identify phenotype features. In this research we evaluate whether relational machine learning (ML) using inductive logic programming (ILP) can contribute to addressing these issues as a viable approach for EHR-based phenotyping. Methods: Two relational learning ILP approaches and three well-known WEKA (Waikato Environment for Knowledge Analysis) implementations of non-relational approaches (PART, J48, and JRIP) were used to develop models for nine phenotypes. International Classification of Diseases, Ninth Revision (ICD-9) coded EHR data were used to select training cohorts for the development of each phenotypic model. Accuracy, precision, recall, F-Measure, and Area Under the Receiver Operating Characteristic (AUROC) curve statistics were measured for each phenotypic model based on independent manually verified test cohorts. A two-sided binomial distribution test (sign test) compared the five ML approaches across phenotypes for statistical significance. Results: We developed an approach to automatically label training examples using ICD-9 diagnosis codes for the ML approaches being evaluated. Nine phenotypic models for each ML approach were evaluated, resulting in better overall model performance in AUROC using ILP when compared to PART (p = 0.039), J48 (p = 0.003) and JRIP (p = 0.003). Discussion: ILP has the potential to improve phenotyping by independently delivering clinically expert interpretable rules for phenotype definitions, or intuitive phenotypes to assist experts. Conclusion: Relational learning using ILP offers a viable approach to EHR-driven phenotyping.

CloseRead Abstract

2013

Score As You Lift (SAYL): A Statistical Relational Learning Approach to Uplift Modeling

Authors
Nassif, H; Kuusisto, F; Burnside, ES; Page, D; Shavlik, JW; Costa, VS;

Publication
ECML/PKDD (3)

Abstract
We introduce Score As You Lift (SAYL), a novel Statistical Relational Learning (SRL) algorithm, and apply it to an important task in the diagnosis of breast cancer. SAYL combines SRL with the marketing concept of uplift modeling, uses the area under the uplift curve to direct clause construction and final theory evaluation, integrates rule learning and probability assignment, and conditions the addition of each new theory rule to existing ones. Breast cancer, the most common type of cancer among women, is categorized into two subtypes: an earlier in situ stage where cancer cells are still confined, and a subsequent invasive stage. Currently older women with in situ cancer are treated to prevent cancer progression, regardless of the fact that treatment may generate undesirable side-effects, and the woman may die of other causes. Younger women tend to have more aggressive cancers, while older women tend to have more indolent tumors. Therefore older women whose in situ tumors show significant dissimilarity with in situ cancer in younger women are less likely to progress, and can thus be considered for watchful waiting. Motivated by this important problem, this work makes two main contributions. First, we present the first multi-relational uplift modeling system, and introduce, implement and evaluate a novel method to guide search in an SRL framework. Second, we compare our algorithm to previous approaches, and demonstrate that the system can indeed obtain differential rules of interest to an expert on real data, while significantly improving the data uplift. © 2013 Springer-Verlag.

CloseRead Abstract