Publicacoes - INESC TEC

Publicações

Publicações por LIAAD

2023

Privacy-Preserving Machine Learning on Apache Spark

Autores
Brito, CV; Ferreira, PG; Portela, BL; Oliveira, RC; Paulo, JT;

Publicação
IEEE ACCESS

Abstract
The adoption of third-party machine learning (ML) cloud services is highly dependent on the security guarantees and the performance penalty they incur on workloads for model training and inference. This paper explores security/performance trade-offs for the distributed Apache Spark framework and its ML library. Concretely, we build upon a key insight: in specific deployment settings, one can reveal carefully chosen non-sensitive operations (e.g. statistical calculations). This allows us to considerably improve the performance of privacy-preserving solutions without exposing the protocol to pervasive ML attacks. In more detail, we propose Soteria, a system for distributed privacy-preserving ML that leverages Trusted Execution Environments (e.g. Intel SGX) to run computations over sensitive information in isolated containers (enclaves). Unlike previous work, where all ML-related computation is performed at trusted enclaves, we introduce a hybrid scheme, combining computation done inside and outside these enclaves. The experimental evaluation validates that our approach reduces the runtime of ML algorithms by up to 41% when compared to previous related work. Our protocol is accompanied by a security proof and a discussion regarding resilience against a wide spectrum of ML attacks.

FecharLer Abstract

2023

Soteria: Preserving Privacy in Distributed Machine Learning

Autores
Brito, C; Ferreira, P; Portela, B; Oliveira, R; Paulo, J;

Publicação
38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023

Abstract
We propose Soteria, a system for distributed privacy-preserving Machine Learning (ML) that leverages Trusted Execution Environments (e.g. Intel SGX) to run code in isolated containers (enclaves). Unlike previous work, where all ML-related computation is performed at trusted enclaves, we introduce a hybrid scheme, combining computation done inside and outside these enclaves. The conducted experimental evaluation validates that our approach reduces the runtime of ML algorithms by up to 41%, when compared to previous related work. Our protocol is accompanied by a security proof, as well as a discussion regarding resilience against a wide spectrum of ML attacks.

FecharLer Abstract

2023

A systematic evaluation of deep learning methods for the prediction of drug synergy in cancer

Autores
Baptista, D; Ferreira, PG; Rocha, M;

Publicação
PLOS COMPUTATIONAL BIOLOGY

Abstract
Author summaryCancer therapies often fail because tumor cells become resistant to treatment. One way to overcome resistance is by treating patients with a combination of two or more drugs. Some combinations may be more effective than when considering individual drug effects, a phenomenon called drug synergy. Computational drug synergy prediction methods can help to identify new, clinically relevant drug combinations. In this study, we developed several deep learning models for drug synergy prediction. We examined the effect of using different types of deep learning architectures, and different ways of representing drugs and cancer cell lines. We explored the use of biological prior knowledge to select relevant cell line features, and also tested data-driven feature reduction methods. We tested both precomputed drug features and deep learning methods that can directly learn features from raw representations of molecules. We also evaluated whether including genomic features, in addition to gene expression data, improves the predictive performance of the models. Through these experiments, we were able to identify strategies that will help guide the development of new deep learning models for drug synergy prediction in the future. One of the main obstacles to the successful treatment of cancer is the phenomenon of drug resistance. A common strategy to overcome resistance is the use of combination therapies. However, the space of possibilities is huge and efficient search strategies are required. Machine Learning (ML) can be a useful tool for the discovery of novel, clinically relevant anti-cancer drug combinations. In particular, deep learning (DL) has become a popular choice for modeling drug combination effects. Here, we set out to examine the impact of different methodological choices on the performance of multimodal DL-based drug synergy prediction methods, including the use of different input data types, preprocessing steps and model architectures. Focusing on the NCI ALMANAC dataset, we found that feature selection based on prior biological knowledge has a positive impact-limiting gene expression data to cancer or drug response-specific genes improved performance. Drug features appeared to be more predictive of drug response, with a 41% increase in coefficient of determination (R-2) and 26% increase in Spearman correlation relative to a baseline model that used only cell line and drug identifiers. Molecular fingerprint-based drug representations performed slightly better than learned representations-ECFP4 fingerprints increased R-2 by 5.3% and Spearman correlation by 2.8% w.r.t the best learned representations. In general, fully connected feature-encoding subnetworks outperformed other architectures. DL outperformed other ML methods by more than 35% (R-2) and 14% (Spearman). Additionally, an ensemble combining the top DL and ML models improved performance by about 6.5% (R-2) and 4% (Spearman). Using a state-of-the-art interpretability method, we showed that DL models can learn to associate drug and cell line features with drug response in a biologically meaningful way. The strategies explored in this study will help to improve the development of computational methods for the rational design of effective drug combinations for cancer therapy.

FecharLer Abstract

2023

Mapeamento do Perfil das Mulheres Brasileiras em Processamento de Linguagem Natural

Autores
Helena Caseli; Evelin Amorim; Elisa Terumi Rubel Schneider; Leidiana Iza Andrade Freitas; Jéssica Rodrigues; Maria das Graças V. Nunes;

Publicação
Anais do XVII Women in Information Technology (WIT 2023)

Abstract
Conhecer o perfil das mulheres brasileiras que atuam em Processamento de Linguagem Natural (PLN) é um importante passo para o desenvolvimento de políticas e programas que visem aumentar a inclusão e a diversidade nessa área. Este é o primeiro trabalho realizado no Brasil com este fim. A partir de dados coletados via consulta pública, Lattes e Linkedin, notou-se que o perfil é de uma formação em computação ou linguística, atuando em empresas ou universidades, mas com pouca diversidade étnica e aparente dificuldade em conciliar vida profissional e maternidade. Analisando mais especificamente o grupo “Brasileiras em PLN” constatou-se uma expressiva capacidade de publicação e orientação, mas ainda uma baixa colaboração entre nossas integrantes.

FecharLer Abstract

2023

Time Series of Counts under Censoring: A Bayesian Approach

Autores
Silva, I; Silva, ME; Pereira, I; McCabe, B;

Publicação
ENTROPY

Abstract
Censored data are frequently found in diverse fields including environmental monitoring, medicine, economics and social sciences. Censoring occurs when observations are available only for a restricted range, e.g., due to a detection limit. Ignoring censoring produces biased estimates and unreliable statistical inference. The aim of this work is to contribute to the modelling of time series of counts under censoring using convolution closed infinitely divisible (CCID) models. The emphasis is on estimation and inference problems, using Bayesian approaches with Approximate Bayesian Computation (ABC) and Gibbs sampler with Data Augmentation (GDA) algorithms.

FecharLer Abstract

2023

Automatic characterisation of Dansgaard-Oeschger events in palaeoclimate ice records

Autores
Barbosa, S; Silva, ME; Dias, N; Rousseau, D;

Publicação

Abstract
Greenland ice core records display abrupt transitions, designated as Dansgaard-Oeschger (DO) events, characterised by episodes of rapid warming (typically decades) followed by a slower cooling. The identification of abrupt transitions is hindered by the typical low resolution and small size of paleoclimate records, and their significant temporal variability. Furthermore, the amplitude and duration of the DO events varies substantially along the last glacial period, which further hinders the objective identification of abrupt transitions from ice core records Automatic, purely data-driven methods, have the potential to foster the identification of abrupt transitions in palaeoclimate time series in an objective way, complementing the traditional identification of transitions by visual inspection of the time series.In this study we apply an algorithmic time series method, the Matrix Profile approach, to the analysis of the NGRIP Greenland ice core record, focusing on:- the ability of the method to retrieve in an automatic way abrupt transitions, by comparing the anomalies identified by the matrix profile method with the expert-based identification of DO events;- the characterisation of DO events, by classifying DO events in terms of shape and identifying events with similar warming/cooling temporal patternThe results for the NGRIP time series show that the matrix profile approach struggles to retrieve all the abrupt transitions that are identified by experts as DO events, the main limitation arising from the diversity in length of DO events and the method’s dependence on fixed-size sub-sequences within the time series. However, the matrix profile method is able to characterise the similarity of shape patterns between DO events in an objective and consistent way.

FecharLer Abstract