2020
Autores
Reis, L; Bispo, J; Cardoso, JMP;
Publicação
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE
Abstract
In order to take advantage of the processing power of current computing platforms, programmers typically need to develop software versions for different target devices. This task is time-consuming and requires significant programming and computer architecture expertise. A possible and more convenient alternative is to start with a single high-level description of a program with minimum implementation details, and generate custom implementations according to the target platform. In this paper, we use MATLAB as a high-level programming language and propose a compiler that targets CPU/GPU computing platforms by generating customized implementations in C and OpenCL. We propose a number of compiler techniques to automatically generate efficient C and OpenCL code from MATLAB programs. One of such compiler techniques relies on heuristics to decide when and how to use Shared Virtual Memory (SVM). The experimental results show that our approach is able to generate code that provides significant speedups (eg, geometric mean speedup of 11x for a set of simple benchmarks) using a discrete GPU over equivalent sequential C code executing on a CPU. With more complex benchmarks, for which only some code regions can be parallelized, and are thus offloaded, the generated code achieved speedups of up to 2.2x. We also show the impact of using SVM, specifically fine-grained buffers, and the results show that the compiler is able to achieve significant speedups, both over the versions without SVM and with naive aggressive SVM use, across three CPU/GPU platforms.
2021
Autores
Santos, T; Paulino, N; Bispo, J; Cardoso, JMP; Ferreira, JC;
Publicação
2021 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (ICFPT)
Abstract
By using Dynamic Binary Translation, instruction traces from pre-compiled applications can be offloaded, at runtime, to FPGA-based accelerators, such as Coarse-Grained Loop Accelerators, in a transparent way. However, scheduling onto coarse-grain accelerators is challenging, with two of current known issues being the density of computations that can be mapped, and the effects of memory accesses on performance. Using an in-house framework for analysis of instruction traces, we explore the effect of different window sizes when applying list scheduling, to map the window operations to a coarse-grain loop accelerator model that has been previously experimentally validated. For all window sizes, we vary the number of ALUs and memory ports available in the model, and comment how these parameters affect the resulting latency. For a set of benchmarks taken from the PolyBench suite, compiled for the 32-bit MicroBlaze softcore, we have achieved an average iteration speedup of 5.10x for a basic block repeated 5 times and scheduled with 8 ALUs and memory ports, and an average speedup of 5.46x when not considering resource constraints. We also identify which benchmarks contribute to the difference between these two speedups, and breakdown their limiting factors. Finally, we reflect on the impact memory dependencies have on scheduling.
2021
Autores
Cardoso, JMP; DeHon, A; Pozzi, L;
Publicação
IEEE TRANSACTIONS ON COMPUTERS
Abstract
The papers in this special section focus on compiler optimization for FPGA-based systems. Reconfigurable computing (RC) is growing in importance in many computing domains and systems, from embedded, mobile to cloud, and high-performance computing. We have witnessed important advancements regarding the programming of RC-based systems, but further improvements are needed, especially regarding efficient techniques for automatic mapping of computations described in high-level languages to the RC resources. The resources of high-end FPGAs allow these devices to implement complex Systemson-a-Chip (SoCs) and substantial computational components of software applications, e.g., when used as hardware accelerators and/or as more energy-efficient computing platforms. This, however, increases the continuous need for efficient compilers targeting FPGAs, and other RC platforms, from high-level programming languages.
2025
Autores
Santos, T; Bispo, J; Cardoso, JMP; Hoe, JC;
Publicação
2025 IEEE 33RD ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, FCCM
Abstract
Heterogeneous CPU-FPGA C/C++ applications may rely on High-level Synthesis (HLS) tools to generate hardware for critical code regions. As typical HLS tools have several restrictions in terms of supported language features, to increase the size and variety of offloaded regions, we propose several code transformations to improve synthesizability. Such code transformations include: struct and array flattening; moving dynamic memory allocations out of a region; transforming dynamic memory allocations into static; and asynchronously executing host functions, e.g., printf(). We evaluate the impact of these transformations on code region size using three realworld applications whose critical regions are limited by nonsynthesizable C/C++ language features.
2025
Autores
Santos, T; Bispo, J; Cardoso, JMP;
Publicação
2025 IEEE 33RD ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, FCCM
Abstract
Critical performance regions of software applications are often accelerated by offloading them onto an FPGA. An efficient end result requires the judicious application of two processes: hardware/software (hw/sw) partitioning, which identifies the regions for offloading, and the optimization of those regions for efficient High-level Synthesis (HLS). Both processes are commonly applied separately, not relying on any potential interplay between them, and not revealing how the decisions made in one process could positively influence the other. This paper describes our primary efforts and contributions made so far, and our work-in-progress, in an approach that combines both hw/sw partitioning and optimization into a unified, holistic process, automated using source-to-source compilation. By using an Extended Task Graph (ETG) representation of a C/C++ application, and expanding the synthesizable code regions, our approach aims at creating clusters of tasks for offloading by a) maximizing the potential optimizations applied to the cluster, b) minimizing the global communication cost, and c) grouping tasks that share data in the same cluster.
2025
Autores
Cardoso, JMP; Najjar, WA;
Publicação
Applied Reconfigurable Computing. Architectures, Tools, and Applications - 21st International Symposium, ARC 2025, Seville, Spain, April 9-11, 2025, Proceedings
Abstract
The International Symposium on Applied Reconfigurable Computing (ARC) is an annual forum for the discussion and dissemination of research, notably applying the Reconfigurable Computing (RC) concept to real-world problems. The first edition of ARC took place in 2005, and in 2024, ARC celebrated its 20th edition. During those 20 years, the field of reconfigurable computing saw a tremendous growth in its underlying technology. ARC contributed very significantly to the presentation and dissemination of new ideas, innovative applications, and fruitful discussions, all of which have resulted in the shaping of novel lines of research. Here, we present selected papers from the first 20 years of ARC, that we believe represent the corpus of work and reflect the ARC spirit by covering a broad spectrum of RC applications, benchmarks, tools, and architectures. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.