Detalhes
Nome
Nuno Miguel PaulinoCluster
Redes de Sistemas InteligentesCargo
Investigador AuxiliarDesde
01 julho 2012
Nacionalidade
PortugalCentro
Centro de Telecomunicações e MultimédiaContactos
+351222094000
nuno.m.paulino@inesctec.pt
2019
Autores
Paulino, NMC; Ferreira, JC; Cardoso, JMP;
Publicação
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Abstract
2017
Autores
Paulino, NMC; Ferreira, JC; Cardoso, JMP;
Publicação
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Abstract
Many embedded applications process large amounts of data using regular computational kernels, amenable to acceleration by specialized hardware coprocessors. To reduce the significant design effort, the dedicated hardware may be automatically generated, usually starting from the application's source or binary code. This paper presents a moduloscheduled loop accelerator capable of executing multiple loops and a supporting toolchain. A generation/scheduling procedure, which fully relies on MicroBlaze instruction traces, produces accelerator instances, customized in terms of functional units and interconnections. The accelerators support integer and single-precision floating-point arithmetic, and exploit instruction-level parallelism, loop pipelining, and memory access parallelism via two read/write ports. A complete implementation of the proposed architecture is evaluated in a Virtex-7 device. Augmenting a MicroBlaze processor with a tailored accelerator achieves a geometric mean speedup, over software-only execution, of 6.61x for 13 floating-point kernels from the Livermore Loops set, and of 4.08x for 11 integer kernels from Texas Instruments' IMGLIB. The proposed customized accelerators are compared with ALU-based ones. The average specialized accelerator requires only 0.47x the number of field-programmable gate array slices of an accelerator with four ALUs. A geometric mean speedup of 1.78x over a four-issue very long instruction word (without floating-point support) was obtained for the integer kernels.
2017
Autores
Paulino, N; Reis, L; Cardoso, JMP;
Publicação
Parallel Computing is Everywhere, Proceedings of the International Conference on Parallel Computing, ParCo 2017, 12-15 September 2017, Bologna, Italy
Abstract
2015
Autores
Paulino, N; Ferreira, JC; Bispo, J; Cardoso, JMP;
Publicação
Proceedings -Design, Automation and Test in Europe, DATE
Abstract
The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences of GPP instructions need to be translated/mapped to hardware, possibly at runtime. Thus, all mapping steps, from compiler analysis and optimizations to hardware generation, need to be both efficient and fast. This paper introduces some of the most representative approaches for binary acceleration using reconfigurable hardware, and presents our binary acceleration approach and the latest results. Our approach extends a GPP with a Reconfigurable Processing Unit (RPU), both sharing the data memory. Repeating sequences of GPP instructions are migrated to an RPU composed of functional units and interconnect resources, and able to exploit instruction-level parallelism, e.g., via loop pipelining. Although we envision a fully dynamic system, currently the RPU resources are selected and organized offline using execution trace information. We present implementation prototypes of the system on a Spartan-6 FPGA with a MicroBlaze as GPP and the very encouraging results achieved with a number of benchmarks. © 2015 EDAA.
2014
Autores
Paulino, N; Ferreira, JC; Cardoso, JMP;
Publicação
2014 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA)
Abstract
This paper presents a binary acceleration approach based on extending a General Purpose Processor (GPP) with a Reconfigurable Processing Unit (RPU), both sharing an external data memory. In this approach repeating sequences of GPP instructions are migrated to the RPU. The RPU resources are selected and organized off-line using execution trace information. The RPU core is composed of Functional Units (FUs) that correspond to single CPU instructions. The FUs are arranged in stages of mutually independent operations. The RPU can enable several stages in tandem, depending on the data dependencies. External data memory accesses are handled by a configurable dual-port cache. A prototype implementation of the architecture on a Spartan-6 FPGA was validated with 12 benchmarks and achieved an overall geometric mean speedup of 1.91x.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.