Publications

Publications by João Paiva Cardoso

2017

Source code transformations and optimizations

Authors
Cardoso, JM; Coutinho, JGF; Diniz, PC;

Publication
Embedded Computing for High Performance

Abstract

2017

Source code analysis and instrumentation

Authors
Cardoso, JM; Coutinho, JGF; Diniz, PC;

Publication
Embedded Computing for High Performance

Abstract

2017

Controlling the design and development cycle

Authors
Cardoso, JM; Coutinho, JGF; Diniz, PC;

Publication
Embedded Computing for High Performance

Abstract

2017

High-performance embedded computing

Authors
Cardoso, JM; Coutinho, JGF; Diniz, PC;

Publication
Embedded Computing for High Performance

Abstract

2024

A Flexible-Granularity Task Graph Representation and Its Generation from C Applications (WIP)

Authors
Santos, T; Bispo, J; Cardoso, JMP;

Publication
PROCEEDINGS OF THE 25TH ACM SIGPLAN/SIGBED INTERNATIONAL CONFERENCE ON LANGUAGES, COMPILERS, AND TOOLS FOR EMBEDDED SYSTEMS, LCTES 2024

Abstract
Modern hardware accelerators, such as FPGAs, allow offloading large regions of C/C++ code in order to improve the execution time and/or the energy consumption of software applications. An outstanding challenge with this approach, however, is solving the Hardware/Software (Hw/Sw) partitioning problem. Given the increasing complexity of both the accelerators and the potential code regions, one needs to adopt a holistic approach when selecting an offloading region by exploring the interplay between communication costs, data usage patterns, and target-specific optimizations. To this end, we propose representing a C application as an extended task graph (ETG) with flexible granularity, which can be manipulated through the merging and splitting of tasks. This approach involves generating a task graph overlay on the program's Abstract Syntax Tree (AST) that maps tasks to functions and the flexible granularity operations onto inlining/outlining operations. This maintains the integrity and readability of the original source code, which is paramount for targeting different accelerators and enabling code optimizations, while allowing the offloading of code regions of arbitrary complexity based on the data patterns of their tasks. To evaluate the ETG representation and its compiler, we use the latter to generate ETGs for the programs in Rosetta and MachSuite benchmark suites, and extract several metrics regarding data communication, task-level parallelism, and dataflow patterns between pairs of tasks. These metrics provide important information that can be used by Hw/Sw partitioning methods.

CloseRead Abstract

2023

A CPU-FPGA Holistic Source-To-Source Compilation Approach for Partitioning and Optimizing C/C plus plus Applications

Authors
Santos, T; Bispo, J; Cardoso, JMP;

Publication
2023 32ND INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT

Abstract
A common approach for improving performance uses FPGAs to accelerate critical code regions, which often involves two processes: hardware/software partitioning, which identifies regions to offload to the FPGA; and optimizing those regions (e.g., through HLS directives). As both processes are separate and usually applied in sequence, the interplay between them is unnatural, and it is unclear how the choices made in one step can benefit the choices made in the other step. This paper presents our work-in-progress for combining partitioning and optimization into a single holistic process. First, our source-to-source compiler builds a task-based representation from the input application. Then, a greedy algorithm builds clusters of tasks and assigns each cluster to either hardware (FPGA) or software (CPU). The algorithm iteratively refines the clusters and offloading decisions by: a) minimizing the communication costs between clusters by assigning tasks that work with shared data to the same cluster; b) reducing the global execution time by applying code optimizations to the tasks in each cluster. We show the impact of our holistic approach to a motivating edge detection example and compare the results when applying partitioning and code optimizations as independent steps. The results show that a holistic partitioning can lead to a speedup of up to 28.7x when compared to a simple offloading of the application to an FPGA.

CloseRead Abstract