Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
About
Download Photo HD

About

I received my Master's Degree from FEUP (Faculdade de Engenharia da Universidade do Porto), in Electrical and Computer Engineering. My thesis was titled Generation of Reconfigurable Circuits from Machine Code, a work which continued throughout my PhD in Electrical and Computer Engineering, also at FEUP, and in association with INESC-TEC.

Having completed my PhD thesis, Generation of Custom Run-time Reconfigurable Hardware for Transparent Binary Acceleration, I am now a post-doc researcher with INESC-TEC on the topic of special compilers for hardware, and also an Auxiliary Assistant Professor with the Department of Informatics at FEUP.

Interest
Topics
Details

Details

002
Publications

2020

Improving performance and energy consumption in embedded systems via binary acceleration: A survey

Authors
Paulin, N; Ferreira, JC; Cardoso, JMP;

Publication
ACM Computing Surveys

Abstract
The breakdown of Dennard scaling has resulted in a decade-long stall of the maximum operating clock frequencies of processors. To mitigate this issue, computing shifted to multi-core devices. This introduced the need for programming flows and tools that facilitate the expression of workload parallelism at high abstraction levels. However, not all workloads are easily parallelizable, and the minor improvements to processor cores have not significantly increased single-threaded performance. Simultaneously, Instruction Level Parallelism in applications is considerably underexplored. This article reviews notable approaches that focus on exploiting this potential parallelism via automatic generation of specialized hardware from binary code. Although research on this topic spans over more than 20 years, automatic acceleration of software via translation to hardware has gained new importance with the recent trend toward reconfigurable heterogeneous platforms. We characterize this kind of binary acceleration approach and the accelerator architectures on which it relies. We summarize notable state-of-the-art approaches individually and present a taxonomy and comparison. Performance gains from 2.6× to 5.6× are reported, mostly considering bare-metal embedded applications, along with power consumption reductions between 1.3× and 3.9×. We believe the methodologies and results achievable by automatic hardware generation approaches are promising in the context of emergent reconfigurable devices. © 2020 Association for Computing Machinery.

2020

Optimizing OpenCL Code for Performance on FPGA: k-Means Case Study With Integer Data Sets

Authors
Paulino, N; Ferreira, JC; Cardoso, JMP;

Publication
IEEE Access

Abstract

2020

Executing ARMv8 Loop Traces on Reconfigurable Accelerator via Binary Translation Framework

Authors
Paulino, N; Ferreira, JC; Bispo, J; Cardoso, JMP;

Publication
30th International Conference on Field-Programmable Logic and Applications, FPL 2020, Gothenburg, Sweden, August 31 - September 4, 2020

Abstract

2019

Dynamic Partial Reconfiguration of Customized Single-Row Accelerators

Authors
Paulino, NMC; Ferreira, JC; Cardoso, JMP;

Publication
IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Abstract

2017

Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces

Authors
Paulino, NMC; Ferreira, JC; Cardoso, JMP;

Publication
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Abstract
Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's source or binary code. This paper presents a moduloscheduled loop accelerator capable of executing multiple loops and a supporting toolchain. A generation/scheduling procedure, which fully relies on MicroBlaze instruction traces, produces accelerator instances, customized in terms of functional units and interconnections. The accelerators support integer and single-precision floating-point arithmetic, and exploit instruction-level parallelism, loop pipelining, and memory access parallelism via two read/write ports. A complete implementation of the proposed architecture is evaluated in a Virtex-7 device. Augmenting a MicroBlaze processor with a tailored accelerator achieves a geometric mean speedup, over software-only execution, of 6.61x for 13 floating-point kernels from the Livermore Loops set, and of 4.08x for 11 integer kernels from Texas Instruments' IMGLIB. The proposed customized accelerators are compared with ALU-based ones. The average specialized accelerator requires only 0.47x the number of field-programmable gate array slices of an accelerator with four ALUs. A geometric mean speedup of 1.78x over a four-issue very long instruction word (without floating-point support) was obtained for the integer kernels.