Publications

Publications by João Tiago Paulo

2016

Efficient Deduplication in a Distributed Primary Storage Infrastructure

Authors
Paulo, J; Pereira, J;

Publication
ACM TRANSACTIONS ON STORAGE

Abstract
A large amount of duplicate data typically exists across volumes of virtual machines in cloud computing infrastructures. Deduplication allows reclaiming these duplicates while improving the cost-effectiveness of large-scale multitenant infrastructures. However, traditional archival and backup deduplication systems impose prohibitive storage overhead for virtual machines hosting latency-sensitive applications. Primary deduplication systems reduce such penalty but rely on special cluster filesystems, centralized components, or restrictive workload assumptions. Also, some of these systems reduce storage overhead by confining deduplication to off-peak periods that may be scarce in a cloud environment. We present DEDIS, a dependable and fully decentralized system that performs cluster-wide off-line deduplication of virtual machines' primary volumes. DEDIS works on top of any unsophisticated storage backend, centralized or distributed, as long as it exports a basic shared block device interface. Also, DEDIS does not rely on data locality assumptions and incorporates novel optimizations for reducing deduplication overhead and increasing its reliability. The evaluation of an open-source prototype shows that minimal I/O overhead is achievable even when deduplication and intensive storage I/O are executed simultaneously. Also, our design scales out and allows collocating DEDIS components and virtual machines in the same servers, thus, sparing the need of additional hardware.

CloseRead Abstract

2016

On the Cost of Safe Storage for Public Clouds: an Experimental Evaluation

Authors
Burihabwa, D; Pontes, R; Felber, P; Maia, F; Mercier, H; Oliveira, R; Paulo, J; Schiavoni, V;

Publication
PROCEEDINGS OF 2016 IEEE 35TH SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS)

Abstract
Cloud-based storage services such as Dropbox, Google Drive and OneDrive are increasingly popular for storing enterprise data, and they have already become the de facto choice for cloud-based backup of hundreds of millions of regular users. Drawn by the wide range of services they provide, no upfront costs and 24/7 availability across all personal devices, customers are well-aware of the benefits that these solutions can bring. However, most users tend to forget-or worse ignore-some of the main drawbacks of such cloud-based services, namely in terms of privacy. Data entrusted to these providers can be leaked by hackers, disclosed upon request from a governmental agency's subpoena, or even accessed directly by the storage providers (e.g., for commercial benefits). While there exist solutions to prevent or alleviate these problems, they typically require direct intervention from the clients, like encrypting their data before storing it, and reduce the benefits provided such as easily sharing data between users. This practical experience report studies a wide range of security mechanisms that can be used atop standard cloud-based storage services. We present the details of our evaluation testbed and discuss the design choices that have driven its implementation. We evaluate several state-of-the-art techniques with varying security guarantees responding to user-assigned security and privacy criteria. Our results reveal the various trade-offs of the different techniques by means of representative workloads on top of industry-grade storage services.

CloseRead Abstract

2016

Resource Usage Prediction in Distributed Key-Value Datastores

Authors
Cruz, F; Maia, F; Matos, M; Oliveira, R; Paulo, J; Pereira, J; Vilaca, R;

Publication
DISTRIBUTED APPLICATIONS AND INTEROPERABLE SYSTEMS, DAIS 2016

Abstract
In order to attain the promises of the Cloud Computing paradigm, systems need to be able to transparently adapt to environment changes. Such behavior benefits from the ability to predict those changes in order to handle them seamlessly. In this paper, we present a mechanism to accurately predict the resource usage of distributed key-value datastores. Our mechanism requires offline training but, in contrast with other approaches, it is sufficient to run it only once per hardware configuration and subsequently use it for online prediction of database performance under any circumstance. The mechanism accurately estimates the database resource usage for any request distribution with an average accuracy of 94 %, only by knowing two parameters: (i) cache hit ratio; and (ii) incoming throughput. Both input values can be observed in real time or synthesized for request allocation decisions. This novel approach is sufficiently simple and generic, while simultaneously being suitable for other practical applications.

CloseRead Abstract

2016

SafeRegions: Performance evaluation of multi-party protocols on HBase

Authors
Pontes, R; Maia, F; Paulo, J; Vilaca, R;

Publication
2016 IEEE 35TH INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS WORKSHOPS (SRDSW)

Abstract
On-line applications and services are now a critical part of our everyday life. Using these services typically requires us to trust our personal or company's information to a large number of third-party entities. These entities enforce several security measures to avoid unauthorized accesses but data is still stored on common database systems that are designed without data privacy concerns in mind. As a result, data is vulnerable against anyone with direct access to the database, which may be external attackers, malicious insiders, spies or even subpoenas. Building strong data privacy mechanisms on top of common database systems is possible but has a significant impact on the system's resources, computational capabilities and performance. Notably, the amount of useful computation that may be done over strongly encrypted data is close to none, which defeats the purpose of offloading computation to third-party services. In this paper, we propose to shift the need to trust in the honesty and security of service providers to simply trust that they will not collude. This is reasonable as cloud providers, being competitors, do not share data among themselves. We focus on NoSQL databases and present SafeRegions, a novel prototype of a distributed and secure NoSQL database that is built on top of HBase and that guarantees strong data privacy while still providing most of HBase's query capabilities. SafeRegions relies on secret sharing and multiparty computation techniques to provide a NoSQL database built on top of multiple, non-colluding service providers that appear as a single one to the user. Strikingly, service providers, individually, cannot disclose any of the user's data but, together, are able to offer data storage and processing capabilities. Additionally, we evaluate SafeRegions exposing performance trade-offs imposed by security mechanisms and provide useful insights for future research on performance optimization.

CloseRead Abstract

2014

Distributed Exact Deduplication for Primary Storage Infrastructures

Authors
Paulo, J; Pereira, J;

Publication
DISTRIBUTED APPLICATIONS AND INTEROPERABLE SYSTEMS (DAIS 2014)

Abstract
Deduplication of primary storage volumes in a cloud computing environment is increasingly desirable, as the resulting space savings contribute to the cost effectiveness of a large scale multi-tenant infrastructure. However, traditional archival and backup deduplication systems impose prohibitive overhead for latency-sensitive applications deployed at these infrastructures while, current primary deduplication systems rely on special cluster filesystems, centralized components, or restrictive workload assumptions. We present DEDIS, a fully-distributed and dependable system that performs exact and cluster-wide background deduplication of primary storage. DEDIS does not depend on data locality and works on top of any unsophisticated storage backend, centralized or distributed, that exports a basic shared block device interface. The evaluation of an open-source prototype shows that DEDIS scales out and adds negligible overhead

CloseRead Abstract

2013

MeT: workload aware elasticity for NoSQL

Authors
Cruz, F; Maia, F; Matos, M; Oliveira, R; Paulo, J; Pereira, J; Vilaça, R;

Publication
Eighth Eurosys Conference 2013, EuroSys '13, Prague, Czech Republic, April 14-17, 2013

Abstract
NoSQL databases manage the bulk of data produced by modern Web applications such as social networks. This stems from their ability to partition and spread data to all available nodes, allowing NoSQL systems to scale. Unfortunately, current solutions' scale out is oblivious to the underlying data access patterns, resulting in both highly skewed load across nodes and suboptimal node configurations. In this paper, we first show that judicious placement of HBase partitions taking into account data access patterns can improve overall throughput by 35%. Next, we go beyond current state of the art elastic systems limited to uninformed replica addition and removal by: i) reconfiguring existing replicas according to access patterns and ii) adding replicas specifically configured to the expected access pattern. MeT is a prototype for a Cloud-enabled framework that can be used alone or in conjunction with OpenStack for the automatic and heterogeneous reconfiguration of a HBase deployment. Our evaluation, conducted using the YCSB workload generator and a TPC-C workload, shows that MeT is able to i) autonomously achieve the performance of a manual configured cluster and ii) quickly reconfigure the cluster according to unpredicted workload changes. © 2013 ACM.

CloseRead Abstract