Publications

Publications by Nuno Miguel Paulino

2014

Trace-Based Reconfigurable Acceleration with Data Cache and External Memory Support

Authors
Paulino, N; Ferreira, JC; Cardoso, JMP;

Publication
2014 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA)

Abstract
This paper presents a binary acceleration approach based on extending a General Purpose Processor (GPP) with a Reconfigurable Processing Unit (RPU), both sharing an external data memory. In this approach repeating sequences of GPP instructions are migrated to the RPU. The RPU resources are selected and organized off-line using execution trace information. The RPU core is composed of Functional Units (FUs) that correspond to single CPU instructions. The FUs are arranged in stages of mutually independent operations. The RPU can enable several stages in tandem, depending on the data dependencies. External data memory accesses are handled by a configurable dual-port cache. A prototype implementation of the architecture on a Spartan-6 FPGA was validated with 12 benchmarks and achieved an overall geometric mean speedup of 1.91x.

CloseRead Abstract

2017

On Coding Techniques for Targeting FPGAs via OpenCL

Authors
Paulino, N; Reis, L; Cardoso, JMP;

Publication
Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing, ParCo 2017, 12-15 September 2017, Bologna, Italy

Abstract
Software developers have always found it difficult to adopt Field-Programmable Gate Arrays (FPGAs) as computing platforms. Recent advances in HLS tools aim to ease the mapping of computations to FPGAs by abstracting the hardware design effort via a standard OpenCL interface and execution model. However, OpenCL is a low-level programming language and requires that developers master the target architecture in order to achieve efficient results. Thus, efforts addressing the generation of OpenCL from high-level languages are of paramount importance to increase design productivity and to help software developers. Existing approaches bridge this by translating MATLAB/Octave code into C, or similar languages, in order to improve performance by efficiently compiling for the target hardware. One example is the MATISSE source-to-source compiler, which translates MATLAB code into standard-compliant C and/or OpenCL code. In this paper, we analyse the viability of combining both flows so that sections of MATLAB code can be translated to specialized hardware with a small amount of effort, and test a few code optimizations and their effect on performance. We present preliminary results relative to execution times, and resource and power consumption, for two OpenCL kernels generated by MATISSE, and manual optimizations of each kernel based on different coding techniques. © 2018 The authors and IOS Press.

CloseRead Abstract

2019

Dynamic Partial Reconfiguration of Customized Single-Row Accelerators

Authors
Paulino, NMC; Ferreira, JC; Cardoso, JMP;

Publication
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Abstract
The use of specialized accelerator circuits is a feasible solution to address performance and energy issues in embedded systems. This paper extends a previous field-programmable gate array-based approach that automatically generates pipelined customized loop accelerators (CLAs) from runtime instruction traces. Despite efficient acceleration, the approach suffered from high area and resource requirements when offloading a large number of kernels from the target application. This paper addresses this by enhancing the CLA with dynamic partial reconfiguration (DPR) support. Each kernel to accelerate is implemented as a variant of a reconfigurable area of the CLA which hosts all functional units and configuration memory. Evaluation of the proposed system is performed on a Virtex-7 device. We show, for a set of 21 kernels, that when comparing two CLAs capable of accelerating the same subset of kernels, the one which benefits from DPR can be up to 4.3x smaller. Resorting to DPR allows for the implementation of CLAs which support numerous kernels without a significant decrease in operating frequency and does not affect the initiation intervals at which kernels are scheduled. Finally, the area required by a CLA instance can be further reduced by increasing the IIs of the scheduled kernels.

CloseRead Abstract

2020

Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey

Authors
Paulin, N; Ferreira, JC; Cardoso, JMP;

Publication
ACM COMPUTING SURVEYS

Abstract
The breakdown of Dennard scaling has resulted in a decade-long stall of the maximum operating clock frequencies of processors. To mitigate this issue, computing shifted to multi-core devices. This introduced the need for programming flows and tools that facilitate the expression of workload parallelism at high abstraction levels. However, not all workloads are easily parallelizable, and the minor improvements to processor cores have not significantly increased single-threaded performance. Simultaneously, Instruction Level Parallelism in applications is considerably underexplored. This article reviews notable approaches that focus on exploiting this potential parallelism via automatic generation of specialized hardware from binary code. Although research on this topic spans over more than 20 years, automatic acceleration of software via translation to hardware has gained new importance with the recent trend toward reconfigurable heterogeneous platforms. We characterize this kind of binary acceleration approach and the accelerator architectures on which it relies. We summarize notable state-of-the-art approaches individually and present a taxonomy and comparison. Performance gains from 2.6x to 5.6x are reported, mostly considering bare-metal embedded applications, along with power consumption reductions between 1.3x and 3.9x. We believe the methodologies and results achievable by automatic hardware generation approaches are promising in the context of emergent reconfigurable devices.

CloseRead Abstract

2020

Optimizing OpenCL Code for Performance on FPGA: k-Means Case Study With Integer Data Sets

Authors
Paulino, N; Ferreira, JC; Cardoso, JMP;

Publication
IEEE ACCESS

Abstract
High Level Synthesis (HLS) tools targeting Field Programmable Gate Arrays (FPGAs) aim to provide a method for programming these devices via high-level abstractions. Initially, HLS support for FPGAs focused on compiling C/C CC to hardware circuits. This raised the issue of determining the programming practices which resulted in the best performing circuits. Recently, to further increase the applicability of HLS approaches, renewed effort was placed on support for HLS of OpenCL code for FPGA, raising the same issues of coding practices and performance portability. This paper explores the performance of OpenCL code compiled for FPGAs for different coding techniques. We evaluate the use of task-kernels versus NDRange kernels, data vectorization, the use of on-chip local memories, and data transfer optimizations by exploiting burst access inference. We present this exploration via a case study of the k-means algorithm, and produce a total of 10 OpenCL implementations of the kernel. To determine the effects of different data set characteristics, and to determine the gains from specialization based on number of attributes, we generated a total of 12 integer data sets. The data sets vary regarding the number of instances, number of attributes (i.e., features), and number of clusters. We also vary the number of processing cores, and present the resulting required resources and operating frequencies. Finally, we execute the same OpenCL code on a 4 GHz Intel i7-6700K CPU, showing that the FPGA achieves speedups up to 1.54 x for four cases, and energy savings up to 80% in all cases.

CloseRead Abstract

2020

Executing ARMv8 Loop Traces on Reconfigurable Accelerator via Binary Translation Framework

Authors
Paulino, N; Ferreira, JC; Bispo, J; Cardoso, JMP;

Publication
2020 30TH INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL)

Abstract
Performance and power efficiency in edge and embedded systems can benefit from specialized hardware. To avoid the effort of manual hardware design, we explore the generation of accelerator circuits from binary instruction traces for several Instruction Set Architectures.

CloseRead Abstract