Publications

Publications by HumanISE

2015

Programming strategies for contextual runtime specialization

Authors
Carvalho, T; Pinto, P; Cardoso, JMP;

Publication
Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES 2015

Abstract
Runtime adaptability is expected to adjust the application and the mapping of computations according to usage contexts, operating environments, resources availability, etc. However, extending applications with adaptive features can be a complex task, especially due to the current lack of programming models and compiler support. One of the runtime adaptability possibilities is the use of specialized code according to data workloads and environments. Traditional approaches use multiple code versions generated offline and, during runtime, a strategy is responsible to select a code version. Moving code generation to runtime can achieve important improvements but may impose unacceptable overhead. This paper presents an aspect-oriented programming approach for runtime adaptability. We focus on a separation of concerns (strategies vs. application) promoted by a domain-specific language for programming runtime strategies. Our strategies allow runtime specialization based on contextual information. We use a template-based runtime code generation approach to achieve program specialization. We demonstrate our approach with examples from image processing, which depict the benefits of runtime specialization and illustrate how several factors need to be considered to eficiently adapt the application. © 2015 ACM.

CloseRead Abstract

2015

Reducing Misses to External Memory Accesses in Task-Level Pipelining

Authors
Azarian, A; Cardoso, JMP;

Publication
2015 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS)

Abstract
Recently, researchers have shown an increased interest in using task-level pipelining to accelerate the overall execution of applications mainly consisting of producer-consumer tasks. This paper proposes optimization techniques for enhancing our approach to pipeline the execution of producer-consumer tasks in FPGA-based multicore architectures with reductions in the number of accesses to external memory. Our approach is able to speedup the overall execution of successive, data-dependent tasks, by using multiple cores and specific customization features provided by FPGAs. We evaluate the impact in the performance of task-level pipelining when using different hash functions and optimization schemes in the inter stage buffer (ISB). The optimizations proposed in this paper were evaluated with FPGA implementations. The experimental results show the efficiency of a simple scheme to reduce external memory accesses and the suitability of the hash function being used. Furthermore, the results reveal noticeable performance improvements for the set of benchmarks being used.

CloseRead Abstract

2015

Techniques for efficient MATLAB-to-C compilation

Authors
Bispo, J; Reis, L; Cardoso, JMP;

Publication
Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY@PLDI, Portland, OR, USA, June 15 - 17, 2015

Abstract
MATLAB to C translation is foreseen to raise the overall abstraction level when mapping computations to embedded systems (possibly consisting of software and hardware components), and thus for increasing productivity and for providing an automated modeldriven design-flow. This paper describes recent work developed in the context of MATISSE, a MATLAB to C compiler targeting embedded systems. We introduce several techniques to allow the efficient generation of C code, such as weak types, primitives and matrix views. We evaluate the compiler with a set of 9 publicly available benchmarks, targeting both embedded systems and a desktop system. We compare the execution time of the generated C code with the original code running on MATLAB, achieving a geometric mean speedup of 8.1 ×, and qualitatively compare our results with the performance of related approaches. The use of the new techniques allowed the compiler to achieve performance improvements of 46% on average.

CloseRead Abstract

2015

C and OpenCL Generation from MATLAB

Authors
Bispo, J; Reis, L; Cardoso, JMP;

Publication
30TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, VOLS I AND II

Abstract
In many engineering and science areas, models are developed and validated using high-level programing languages and environments as is the case with MATLAB. In order to target the multicore heterogeneous architectures being used on embedded systems to provide high performance computing with acceptable energy/power envelops, developers manually migrate critical code sections to lower-level languages such as C and OpenCL, a time consuming and error prone process. Thus, automatic source-to-source approaches are highly desirable. We present an approach to compile MATLAB and output equivalent C/OpenCL code to target architectures, such as GPU based hardware accelerators. We evaluate our approach on an existing MATLAB compiler framework named MATISSE. The OpenCL generation relies on the manual insertion of directives to guide the compilation and is also capable of generating C wrapper code to interface and synchronize with the OpenCL code. We evaluated the compiler with a number of benchmarks from different domains and the results are very encouraging.

CloseRead Abstract

2015

Transparent Acceleration of Program Execution Using Reconfigurable Hardware

Authors
Paulino, N; Ferreira, JC; Bispo, J; Cardoso, JMP;

Publication
2015 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE)

Abstract
The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences of GPP instructions need to be translated/mapped to hardware, possibly at runtime. Thus, all mapping steps, from compiler analysis and optimizations to hardware generation, need to be both efficient and fast. This paper introduces some of the most representative approaches for binary acceleration using reconfigurable hardware, and presents our binary acceleration approach and the latest results. Our approach extends a GPP with a Reconfigurable Processing Unit (RPU), both sharing the data memory. Repeating sequences of GPP instructions are migrated to an RPU composed of functional units and interconnect resources, and able to exploit instruction-level parallelism, e.g., via loop pipelining. Although we envision a fully dynamic system, currently the RPU resources are selected and organized offline using execution trace information. We present implementation prototypes of the system on a Spartan-6 FPGA with a MicroBlaze as GPP and the very encouraging results achieved with a number of benchmarks.

CloseRead Abstract

2015

Use of previously acquired positioning of optimizations for phase ordering exploration

Authors
Nobre, R; Martins, LGA; Cardoso, JMP;

Publication
Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES 2015

Abstract
This paper presents a new approach to efficiently search for suitable compiler pass sequences, a challenge known as phase ordering. Our approach relies on information about the relative positions of compiler passes in compiler pass sequences previously generated for a set of functions when compiling for a specific processor. We enhanced two iterative compiler pass exploration schemes, one relying on simple sequential compiler pass insertion and other implementing an auto-tuned simulated annealing process, with a data structure that holds information about the relative positions of compiler sequences; in order to reduce the set of compiler passes considered for insertion in a given position of a given candidate compiler pass sequence to include only the passes that have a higher probability of performing well on that relative position in the compiler sequence, speeding up the exploration time as a result. We tested our approach with two different compilers and two different targets; the ReflectC and the LLVM compilers, targeting a MicroBlaze processor and a LEON3 processor, respectively. The experimental results show that we can considerably reduce the number of algorithm iterations by a factor of up to more than an order of magnitude when targeting the MicroBlaze or the LEON3, while finding compiler sequences that result in binaries that when executed on the target processor/simulator are able to outperform (i.e. use less CPU cycles) all the standard optimization levels (i.e., we compare against the most performing optimization level flag on each kernel, e.g. -O1, -O2 or -O3 in the case of LLVM) by a geometric mean performance improvement of 1.23x and 1.20x when targeting the MicroBlaze processor, and 1.94x and 2.65x when targetting the LEON3 processor; for each of the two exploration algorithms and two kernel sets considered. © 2015 ACM.

CloseRead Abstract