Cookies Policy
We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out More
Close
  • Menu
About

About

I received my Master's Degree from FEUP (Faculdade de Engenharia da Universidade do Porto), in Electrical and Computer Engineering. My thesis was titled Generation of Reconfigurable Circuits from Machine Code, a work which continued throughout my PhD in Electrical and Computer Engineering, also at FEUP, and in association with INESC-TEC.

Having completed my PhD thesis, Generation of Custom Run-time Reconfigurable Hardware for Transparent Binary Acceleration, I am now a post-doc researcher with INESC-TEC on the topic of special compilers for hardware, and also an Auxiliary Assistant Professor with the Department of Informatics at FEUP.

Interest
Topics
Details

Details

001
Publications

2017

Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces

Authors
Paulino, NMC; Ferreira, JC; Cardoso, JMP;

Publication
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Abstract
Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's source or binary code. This paper presents a moduloscheduled loop accelerator capable of executing multiple loops and a supporting toolchain. A generation/scheduling procedure, which fully relies on MicroBlaze instruction traces, produces accelerator instances, customized in terms of functional units and interconnections. The accelerators support integer and single-precision floating-point arithmetic, and exploit instruction-level parallelism, loop pipelining, and memory access parallelism via two read/write ports. A complete implementation of the proposed architecture is evaluated in a Virtex-7 device. Augmenting a MicroBlaze processor with a tailored accelerator achieves a geometric mean speedup, over software-only execution, of 6.61x for 13 floating-point kernels from the Livermore Loops set, and of 4.08x for 11 integer kernels from Texas Instruments' IMGLIB. The proposed customized accelerators are compared with ALU-based ones. The average specialized accelerator requires only 0.47x the number of field-programmable gate array slices of an accelerator with four ALUs. A geometric mean speedup of 1.78x over a four-issue very long instruction word (without floating-point support) was obtained for the integer kernels.

2015

Transparent acceleration of program execution using reconfigurable hardware

Authors
Paulino, N; Ferreira, JC; Bispo, J; Cardoso, JMP;

Publication
Proceedings -Design, Automation and Test in Europe, DATE

Abstract
The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences of GPP instructions need to be translated/mapped to hardware, possibly at runtime. Thus, all mapping steps, from compiler analysis and optimizations to hardware generation, need to be both efficient and fast. This paper introduces some of the most representative approaches for binary acceleration using reconfigurable hardware, and presents our binary acceleration approach and the latest results. Our approach extends a GPP with a Reconfigurable Processing Unit (RPU), both sharing the data memory. Repeating sequences of GPP instructions are migrated to an RPU composed of functional units and interconnect resources, and able to exploit instruction-level parallelism, e.g., via loop pipelining. Although we envision a fully dynamic system, currently the RPU resources are selected and organized offline using execution trace information. We present implementation prototypes of the system on a Spartan-6 FPGA with a MicroBlaze as GPP and the very encouraging results achieved with a number of benchmarks. © 2015 EDAA.

2013

Transparent runtime migration of loop-based traces of processor instructions to reconfigurable processing units

Authors
Bispo, J; Paulino, N; Cardoso, JMP; Ferreira, JC;

Publication
International Journal of Reconfigurable Computing

Abstract
The ability to map instructions running in a microprocessor to a reconfigurable processing unit (RPU), acting as a coprocessor, enables the runtime acceleration of applications and ensures code and possibly performance portability. In this work, we focus on the mapping of loop-based instruction traces (called Megablocks) to RPUs. The proposed approach considers offline partitioning and mapping stages without ignoring their future runtime applicability. We present a toolchain that automatically extracts specific trace-based loops, called Megablocks, from MicroBlaze instruction traces and generates an RPU for executing those loops. Our hardware infrastructure is able to move loop execution from the microprocessor to the RPU transparently, at runtime, and without changing the executable binaries. The toolchain and the system are fully operational. Three FPGA implementations of the system, differing in the hardware interfaces used, were tested and evaluated with a set of 15 application kernels. Speedups ranging from 1.26 × to 3.69 × were achieved for the best alternative using a MicroBlaze processor with local memory. © 2013 João Bispo et al.

2013

Transparent Trace-Based Binary Acceleration for Reconfigurable HW/SW Systems

Authors
Bispo, J; Paulino, N; Cardoso, JMP; Ferreira, JC;

Publication
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

Abstract
This paper presents a novel approach to accelerate program execution by mapping repetitive traces of executed instructions, called Megablocks, to a runtime reconfigurable array of functional units. An offline tool suite extracts Megablocks from microprocessor instruction traces and generates a Reconfigurable Processing Unit (RPU) tailored for the execution of those Megablocks. The system is able to transparently movebcomputations from the microprocessor to the RPU at runtime. A prototype implementation of the system using a cacheless MicroBlaze microprocessor running code located in external memory reaches speedups from 2.2x to 18.2x for a set of 14 benchmark kernels. For a system setup which maximizes microprocessor performance by having the application code located in internal block RAMs, speedups from 1.4x to 2.8x were estimated.