2024
Authors
Santos, T; Bispo, J; Cardoso, JMP;
Publication
PROCEEDINGS OF THE 25TH ACM SIGPLAN/SIGBED INTERNATIONAL CONFERENCE ON LANGUAGES, COMPILERS, AND TOOLS FOR EMBEDDED SYSTEMS, LCTES 2024
Abstract
Modern hardware accelerators, such as FPGAs, allow offloading large regions of C/C++ code in order to improve the execution time and/or the energy consumption of software applications. An outstanding challenge with this approach, however, is solving the Hardware/Software (Hw/Sw) partitioning problem. Given the increasing complexity of both the accelerators and the potential code regions, one needs to adopt a holistic approach when selecting an offloading region by exploring the interplay between communication costs, data usage patterns, and target-specific optimizations. To this end, we propose representing a C application as an extended task graph (ETG) with flexible granularity, which can be manipulated through the merging and splitting of tasks. This approach involves generating a task graph overlay on the program's Abstract Syntax Tree (AST) that maps tasks to functions and the flexible granularity operations onto inlining/outlining operations. This maintains the integrity and readability of the original source code, which is paramount for targeting different accelerators and enabling code optimizations, while allowing the offloading of code regions of arbitrary complexity based on the data patterns of their tasks. To evaluate the ETG representation and its compiler, we use the latter to generate ETGs for the programs in Rosetta and MachSuite benchmark suites, and extract several metrics regarding data communication, task-level parallelism, and dataflow patterns between pairs of tasks. These metrics provide important information that can be used by Hw/Sw partitioning methods.
2024
Authors
Henriques, M; Bispo, J; Paulino, N;
Publication
PROCEEDINGS OF THE RAPIDO 2024 WORKSHOP, HIPEAC 2024
Abstract
Hardware specialization is seen as a promising venue for improving computing efficiency, with reconfigurable devices as excellent deployment platforms for application-specific architectures. One approach to hardware specialization is via the popular RISC-V, where Instruction Set Architecture (ISA) extensions for domains such as Edge Artifical Intelligence (AI) are already appearing. However, to use the custom instructions while maintaining a high (e.g., C/C++) abstraction level, the assembler and compiler must be modified. Alternatively, inline assembly can be manually introduced by a software developer with expert knowledge of the hardware modifications in the RISC-V core. In this paper, we consider a RISC-V core with a vectorization and streaming engine to support the Unlimited Vector Extension (UVE), and propose an approach to automatically transform annotated C loops into UVE compatible code, via automatic insertion of inline assembly. We rely on a source-to-source transformation tool, Clava, to perform sophisticated code analysis and transformations via scripts. We use pragmas to identify code sections amenable for vectorization and/or streaming, and use Clava to automatically insert inline UVE instructions, avoiding extensive modifications of existing compiler projects. We produce UVE binaries which are functionally correct, when compared to handwritten versions with inline assembly, and achieve equal and sometimes improved number of executed instructions, for a set of six benchmarks from the Polybench suite. These initial results are evidence towards that this kind of translation is feasible, and we consider that it is possible in future work to target more complex transformations or other ISA extensions, accelerating the adoption of hardware/software co-design flows for generic application cases.
2024
Authors
Matos, JN; Bispo, J; Sousa, LM;
Publication
PROCEEDINGS OF THE RAPIDO 2024 WORKSHOP, HIPEAC 2024
Abstract
Modern compiled software, written in languages such as C, relies on complex compiler infrastructure. However, developing new transformations and improving existing ones can be challenging for researchers and engineers. Often, transformations must be implemented bymodifying the compiler itself, which may not be feasible, for technical or legal reasons. Source-to-source compilers make it possible to directly analyse and transform the original source, making transformations portable across different compilers, and allowing rapid research and prototyping of code transformations. However, this approach has the drawback of exposing the researcher to the full breadth of the source language, which is often more extensive and complex than the IRs used in traditional compilers. In this work, we propose a solution to tame the complexity of the source language and make source-to-source compilers an ergonomic platform for program analysis and transformation. We define a simpler subset of the C language that can implement the same programs with fewer constructs and implement a set of sourceto-source transformations that automatically normalise the input source code into equivalent programs expressed in the proposed subset. Finally, we implement a function inlining transformation that targets the subset as a case study. We show that for this case study, the assumptions afforded by using a simpler language subset greatly improves the number of cases the transformation can be applied, increasing the average success rate from 37%, before normalisation, to 97%, after normalisation. We also evaluate the performance of several benchmarks after applying a naive inlining algorithm, and obtained a 12% performance improvement in certain applications, after compiling with the flag O2, both in Clang and GCC, suggesting there is room for exploring source-level transformations as a complement to traditional compilers.
2024
Authors
Bispo, J; Xydis, S; Curzel, S; Sousa, LM;
Publication
PARMA-DITAM
Abstract
2024
Authors
da Silva, MC; Sousa, L; Paulino, N; Bispo, J;
Publication
APPLIED RECONFIGURABLE COMPUTING. ARCHITECTURES, TOOLS, AND APPLICATIONS, ARC 2024
Abstract
This work addresses the contemporary challenges in computing, caused by the stagnation of Moore's Law and Dennard scaling. The shift towards heterogeneous architectures necessitates innovative compilation strategies, prompting initiatives like the Multi-Level Intermediate Representation (MLIR) project, where progressive code lowering can be achieved through the use of dialects. Our work focuses on developing an MLIR dialect capable of representing streaming data accesses to memory, and Single Instruction Multiple Data (SIMD) vector operations. We also propose our own Structured Representation Language (SRL), a Design Specific Language (DSL) to serve as a precursor into the MLIR layer and subsequent inter-operation between new and existing dialects. The SRL exposes the streaming and vector computational concepts to a higher-level, and serves as intermediate step to supporting code generation containing our proposed dialect from arbitrary input code, which we leave as future work. This paper presents the syntaxes of the SRL DSL and of the dialect, and illustrates how we aim to employ them to target both General-Purpose Processors (GPPs) with SIMD co-processors and custom hardware options such as Field-Programmable Gate Arrayss (FPGAs) and Coarse-Grained Re-configurable Arrays (CGRAs).
2023
Authors
Santos, T; Bispo, J; Cardoso, JMP;
Publication
2023 32ND INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT
Abstract
A common approach for improving performance uses FPGAs to accelerate critical code regions, which often involves two processes: hardware/software partitioning, which identifies regions to offload to the FPGA; and optimizing those regions (e.g., through HLS directives). As both processes are separate and usually applied in sequence, the interplay between them is unnatural, and it is unclear how the choices made in one step can benefit the choices made in the other step. This paper presents our work-in-progress for combining partitioning and optimization into a single holistic process. First, our source-to-source compiler builds a task-based representation from the input application. Then, a greedy algorithm builds clusters of tasks and assigns each cluster to either hardware (FPGA) or software (CPU). The algorithm iteratively refines the clusters and offloading decisions by: a) minimizing the communication costs between clusters by assigning tasks that work with shared data to the same cluster; b) reducing the global execution time by applying code optimizations to the tasks in each cluster. We show the impact of our holistic approach to a motivating edge detection example and compare the results when applying partitioning and code optimizations as independent steps. The results show that a holistic partitioning can lead to a speedup of up to 28.7x when compared to a simple offloading of the application to an FPGA.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.