Publications

Publications by João Tiago Paulo

2017

HTAPBench: Hybrid Transactional and Analytical Processing Benchmark

Authors
Coelho, F; Paulo, J; Vilaça, R; Pereira, JO; Oliveira, R;

Publication
Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ICPE 2017, L'Aquila, Italy, April 22-26, 2017

Abstract
The increasing demand for real-time analytics requires the fusion of Transactional (OLTP) and Analytical (OLAP) systems, eschewing ETL processes and introducing a plethora of proposals for the so-called Hybrid Analytical and Trans-actional Processing (HTAP) systems. Unfortunately, current benchmarking approaches are not able to comprehensively produce a unified metric from the assessment of an HTAP system. The evaluation of both engine types is done separately, leading to the use of disjoint sets of benchmarks such as TPC-C or TPC-H. In this paper we propose a new benchmark, HTAPBench, providing a unified metric for HTAP systems geared toward the execution of constantly increasing OLAP requests limited by an admissible impact on OLTP performance. To achieve this, a load balancer within HTAPBench regulates the coexistence of OLTP and OLAP workloads, proposing a method for the generation of both new data and requests, so that OLAP requests over freshly modified data are comparable across runs. We demonstrate the merit of our approach by validating it with different types of systems: OLTP, OLAP and HTAP; showing that the benchmark is able to highlight the differences between them, while producing queries with comparable complexity across experiments with negligible variability. © 2017 ACM.

CloseRead Abstract

2019

TRUSTFS: An SGX-enabled Stackable File System Framework

Authors
Esteves, T; Macedo, R; Faria, A; Portela, B; Paulo, J; Pereira, J; Harnik, D;

Publication
2019 38TH INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS WORKSHOPS (SRDSW 2019)

Abstract
Data confidentiality in cloud services is commonly ensured by encrypting information before uploading it. However, this approach limits the use of content-aware functionalities, such as deduplication and compression. Although this issue has been addressed individually for some of these functionalities, no unified framework for building secure storage systems exists that can leverage such operations over encrypted data. We present TRUSTFS, a programmable and modular stackable file system framework for implementing secure content-aware storage functionalities over hardware-assisted trusted execution environments. This framework extends the original SAFEFS architecture to provide the isolated execution guarantees of Intel SGX. We demonstrate its usability by implementing an SGX-enabled stackable file system prototype while a preliminary evaluation shows that it incurs reasonable performance overhead when compared to conventional storage systems. Finally, we highlight open research challenges that must be further pursued in order for TRUSTFS to be fully adequate for building production-ready secure storage solutions.

CloseRead Abstract Read Full Publication

2019

A Case for Dynamically Programmable Storage Background Tasks

Authors
Macedo, R; Faria, A; Paulo, J; Pereira, J;

Publication
2019 38TH INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS WORKSHOPS (SRDSW 2019)

Abstract
Modern storage infrastructures feature long and complicated I/O paths composed of several layers, each employing their own optimizations to serve varied applications with fluctuating requirements. However, as these layers do not have global infrastructure visibility, they are unable to optimally tune their behavior to achieve maximum performance. Background storage tasks, in particular, can rapidly overload shared resources, but are executed either periodically or whenever a certain threshold is hit regardless of the overall load on the system. In this paper, we argue that to achieve optimal holistic performance, these tasks should be dynamically programmable and handled by a controller with global visibility. To support this argument, we evaluate the impact on performance of compaction and checkpointing in the context of HBase and PostgreSQL. We find that these tasks can respectively increase 99th percentile latencies by 955.2% and 61.9%. We also identify future research directions to achieve programmable background tasks.

CloseRead Abstract

2020

A Survey and Classification of Software-Defined Storage Systems

Authors
Macedo, R; Paulo, J; Pereira, J; Bessani, A;

Publication
ACM COMPUTING SURVEYS

Abstract
The exponential growth of digital information is imposing increasing scale and efficiency demands on modern storage infrastructures. As infrastructure complexity increases, so does the difficulty in ensuring quality of service, maintainability, and resource fairness, raising unprecedented performance, scalability, and programmability challenges. Software-Defined Storage (SDS) addresses these challenges by cleanly disentangling control and data flows, easing management, and improving control functionality of conventional storage systems. Despite its momentum in the research community, many aspects of the paradigm are still unclear, undefined, and unexplored, leading to misunderstandings that hamper the research and development of novel SDS technologies. In this article, we present an in-depth study of SDS systems, providing a thorough description and categorization of each plane of functionality. Further, we propose a taxonomy and classification of existing SDS solutions according to different criteria. Finally, we provide key insights about the paradigm and discuss potential future research directions for the field.

CloseRead Abstract

2021

GenoDedup: Similarity-Based Deduplication and Delta-Encoding for Genome Sequencing Data

Authors
Cogo, V; Paulo, J; Bessani, A;

Publication
IEEE TRANSACTIONS ON COMPUTERS

Abstract
The vast datasets produced in human genomics must be efficiently stored, transferred, and processed while prioritizing storage space and restore performance. Balancing these two properties becomes challenging when resorting to traditional data compression techniques. In fact, specialized algorithms for compressing sequencing data favor the former, while large genome repositories widely resort to generic compressors (e.g., GZIP) to benefit from the latter. Notably, human beings have approximately 99.9 percent of DNA sequence similarity, vouching for an excellent opportunity for deduplication and its assets: leveraging inter-file similarity and achieving higher read performance. However, identity-based deduplication fails to provide a satisfactory reduction in the storage requirements of genomes. In this article, we balance space savings and restore performance by proposing GenoDedup, the first method that integrates efficient similarity-based deduplication and specialized delta-encoding for genome sequencing data. Our solution currently achieves 67.8 percent of the reduction gains of SPRING (i.e., the best specialized tool in this metric) and restores data 1.62x faster than SeqDB (i.e., the fastest competitor). Additionally, GenoDedup restores data 9.96x faster than SPRING and compresses files 2.05x more than SeqDB.

CloseRead Abstract

2020

On the Trade-Offs of Combining Multiple Secure Processing Primitives for Data Analytics

Authors
Carvalho, H; Cruz, D; Pontes, R; Paulo, J; Oliveira, R;

Publication
Distributed Applications and Interoperable Systems - 20th IFIP WG 6.1 International Conference, DAIS 2020, Held as Part of the 15th International Federated Conference on Distributed Computing Techniques, DisCoTec 2020, Valletta, Malta, June 15-19, 2020, Proceedings

Abstract
Cloud Computing services for data analytics are increasingly being sought by companies to extract value from large quantities of information. However, processing data from individuals and companies in third-party infrastructures raises several privacy concerns. To this end, different secure analytics techniques and systems have recently emerged. These initial proposals leverage specific cryptographic primitives lacking generality and thus having their application restricted to particular application scenarios. In this work, we contribute to this thriving body of knowledge by combining two complementary approaches to process sensitive data. We present SafeSpark, a secure data analytics framework that enables the combination of different cryptographic processing techniques with hardware-based protected environments for privacy-preserving data storage and processing. SafeSpark is modular and extensible therefore adapting to data analytics applications with different performance, security and functionality requirements. We have implemented a SafeSpark’s prototype based on Spark SQL and Intel SGX hardware. It has been evaluated with the TPC-DS Benchmark under three scenarios using different cryptographic primitives and secure hardware configurations. These scenarios provide a particular set of security guarantees and yield distinct performance impact, with overheads ranging from as low as 10% to an acceptable 300% when compared to an insecure vanilla deployment of Apache Spark. © IFIP International Federation for Information Processing 2020.

CloseRead Abstract