Publications

Publications by Vítor Santos Costa

2017

Markov logic networks for adverse drug event extraction from text

Authors
Natarajan, S; Bangera, V; Khot, T; Picado, J; Wazalwar, A; Costa, VS; Page, D; Caldwell, M;

Publication
KNOWLEDGE AND INFORMATION SYSTEMS

Abstract
Adverse drug events (ADEs) are a major concern and point of emphasis for the medical profession, government, and society. A diverse set of techniques from epidemiology, statistics, and computer science are being proposed and studied for ADE discovery from observational health data (e.g., EHR and claims data), social network data (e.g., Google and Twitter posts), and other information sources. Methodologies are needed for evaluating, quantitatively measuring and comparing the ability of these various approaches to accurately discover ADEs. This work is motivated by the observation that text sources such as the Medline/Medinfo library provide a wealth of information on human health. Unfortunately, ADEs often result from unexpected interactions, and the connection between conditions and drugs is not explicit in these sources. Thus, in this work, we address the question of whether we can quantitatively estimate relationships between drugs and conditions from the medical literature. This paper proposes and studies a state-of-the-art NLP-based extraction of ADEs from text.

CloseRead Abstract

2016

Predicting Wildfires Propositional and Relational Spatio-Temporal Pre-processing Approaches

Authors
Oliveira, M; Torgo, L; Costa, VS;

Publication
DISCOVERY SCIENCE, (DS 2016)

Abstract
We present and evaluate two different methods for building spatio-temporal features: a propositional method and a method based on propositionalisation of relational clauses. Our motivating application, a regression problem, requires the prediction of the fraction of each Portuguese parish burnt yearly by wildfires - a problem with a strong socio-economic and environmental impact in the country. We evaluate and compare how these methods perform individually and combined together. We successfully use under-sampling to deal with the high skew in the data set. We find that combining the approaches significantly improves the similar results obtained by each method individually.

CloseRead Abstract

2016

Relational Learning with GPUs: Accelerating Rule Coverage

Authors
Martínez Angeles, CA; Wu, HC; Dutra, I; Costa, VS; Buenabad Chávez, J;

Publication
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING

Abstract
Relational learning algorithms mine complex databases for interesting patterns. Usually, the search space of patterns grows very quickly with the increase in data size, making it impractical to solve important problems. In this work we present the design of a relational learning system, that takes advantage of graphics processing units (GPUs) to perform the most time consuming function of the learner, rule coverage. To evaluate performance, we use four applications: a widely used relational learning benchmark for predicting carcinogenesis in rodents, an application in chemo-informatics, an application in opinion mining, and an application in mining health record data. We compare results using a single and multiple CPUs in a multicore host and using the GPU version. Results show that the GPU version of the learner is up to eight times faster than the best CPU version.

CloseRead Abstract

2014

Towards using Probabilities and Logic to Model Regulatory Networks

Authors
Gonçalves, A; Ong, I; Lewis, JA; Costa, VS;

Publication
2014 IEEE 27TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS)

Abstract
Transcriptional regulation plays an important role in every cellular decision. Unfortunately, understanding the dynamics that govern how a cell will respond to diverse environmental cues is difficult using intuition alone. We introduce logic-based regulation models based on state-of-the-art work on statistical relational learning, and validate our approach by using it to analyze time-series gene expression data of the Hog1 pathway. Our results show that plausible regulatory networks can be learned from time series gene expression data using a probabilistic logical model. Hence, network hypotheses can be generated from existing gene expression data for use by experimental biologists.

CloseRead Abstract

2014

Couillard: Parallel programming via coarse-grained Data-flow Compilation

Authors
Marzulo, LAJ; Alves, TAO; França, FMG; Costa, VS;

Publication
PARALLEL COMPUTING

Abstract
Data-flow is a natural approach to parallelism. However, describing dependencies and control between fine-grained data-flow tasks can be complex and present unwanted overheads. TALM (TALM is an Architecture and Language for Multi-threading) introduces a user-defined coarse-grained parallel data-flow model, where programmers identify code blocks, called super-instructions, to be run in parallel and connect them in a data-flow graph. TALM has been implemented as a hybrid Von Neumann/data-flow execution system: the Trebuchet. We have observed that TALM's usefulness largely depends on how programmers specify and connect super-instructions. Thus, we present Couillard, a full compiler that creates, based on an annotated C-program, a data-flow graph and C-code corresponding to each super-instruction. We show that our toolchain allows one to benefit from data-flow execution and explore sophisticated parallel programming techniques, with small effort. To evaluate our system we have executed a set of real applications on a large multi-core machine. Comparison with popular parallel programming methods shows competitive speedups, while providing an easier parallel programing approach. More specifically, for an application that follows the wavefront method, running with big inputs, Trebuchet achieved up to 4.7% speedup over Intel (R) TBB novel flow-graph approach and up to 44% over OpenMP.

CloseRead Abstract

2017

Pharmacovigilance via Baseline Regularization with Large-Scale Longitudinal Observational Data

Authors
Kuang, Z; Peissig, PL; Costa, VS; Maclin, R; Page, D;

Publication
KDD

Abstract
Several prominent public health incidents [29] that occurred at the beginning of this century due to adverse drug events (ADEs) have raised international awareness of governments and industries about pharmacovigilance (PhV) [6, 7], the science and activities to monitor and prevent adverse events caused by pharmaceutical products after they are introduced to the market. A major data source for PhV is large-scale longitudinal observational databases (LODs) [6] such as electronic health records (EHRs) and medical insurance claim databases. Inspired by the Multiple Self-Controlled Case Series (MSCCS) model [27], arguably the leading method for ADE discovery from LODs, we propose baseline regularization, a regularized generalized linear model that leverages the diverse health profiles available in LODs across different individuals at different times. We apply the proposed method as well as MSCCS to the Marshfield Clinic EHR. Experimental results suggest that incorporatingthe heterogeneity among different patients and different times help to improve the performance in identifying benchmark ADEs from the Observational Medical Outcomes Partnership ground truth [26].

CloseRead Abstract