Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by HumanISE

2016

Performance-driven instrumentation and mapping strategies using the LARA aspect-oriented programming approach

Authors
Cardoso, JMP; Coutinho, JGF; Carvalho, T; Diniz, PC; Petrov, Z; Luk, W; Goncalves, F;

Publication
SOFTWARE-PRACTICE & EXPERIENCE

Abstract
The development of applications for high-performance embedded systems is a long and error-prone process because in addition to the required functionality, developers must consider various and often conflicting nonfunctional requirements such as performance and/or energy efficiency. The complexity of this process is further exacerbated by the multitude of target architectures and mapping tools. This article describes LARA, an aspect-oriented programming language that allows programmers to convey domain-specific knowledge and nonfunctional requirements to a toolchain composed of source-to-source transformers, compiler optimizers, and mapping/synthesis tools. LARA is sufficiently flexible to target different tools and host languages while also allowing the specification of compilation strategies to enable efficient generation of software code and hardware cores (using hardware description languages) for hybrid target architectures - a unique feature to the best of our knowledge not found in any other aspect-oriented programming language. A key feature of LARA is its ability to deal with different models of join points, actions, and attributes. In this article, we describe the LARA approach and evaluate its impact on code instrumentation and analysis and on selecting critical code sections to be migrated to hardware accelerators for two embedded applications from industry. Copyright (c) 2014 John Wiley & Sons, Ltd.

2016

AutoTuning and Adaptivity appRoach for Energy efficient eXascale HPC systems: the ANTAREX Approach

Authors
Silvano, C; Agosta, G; Bartolini, A; Beccari, AR; Benini, L; Bispo, J; Cmar, R; Cardoso, JMP; Cavazzoni, C; Martinovic, J; Palermo, G; Palkovic, M; Pinto, P; Rohou, E; Sanna, N; Slaninova, K;

Publication
PROCEEDINGS OF THE 2016 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE)

Abstract
The main goal of the ANTAREX 1 project is to express by a Domain Specific Language (DSL) the application self-adaptivity and to runtime manage and autotune applications for green and heterogeneous High Performance Computing (HPC) systems up to the Exascale level. Key innovations of the project include the introduction of a separation of concerns between self-adaptivity strategies and application functionalities. The DSL approach will allow the definition of energy-efficiency, performance, and adaptivity strategies as well as their enforcement at runtime through application autotuning and resource and power management.

2016

SSA-based MATLAB-to-C compilation and optimization

Authors
Reis, L; Bispo, J; Cardoso, JMP;

Publication
Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY@PLDI 2016, Santa Barbara, CA, USA, June 14, 2016

Abstract
Many fields of engineering, science and finance use models that are developed and validated in high-level languages such as MATLAB. However, when moving to environments with resource constraints or portability challenges, these models often have to be rewritten in lower-level languages such as C. Doing so manually is costly and error-prone, but automated approaches tend to generate code that can be substantially less efficient than the handwritten equivalents. Additionally, it is usually difficult to read and improve code generated by these tools. In this paper, we describe how we improved our MATLAB-to-C compiler, based on the MATISSE framework, to be able to compete with handwritten C code. We describe our new IR and the most important optimizations that we use in order to obtain acceptable performance. We also analyze multiple C code versions to identify where the generated code is slower than the handwritten code and identify a few key improvements to generate code capable of outperforming handwritten C. We evaluate the new version of our compiler using a set of benchmarks, including the Disparity benchmark, from the San Diego Vision Benchmark Suite, on a desktop computer and on an embedded device. The achieved results clearly show the efficiency of the current version of the compiler. Copyright is held by the owner/author(s). Publication rights licensed to ACM.

2016

Towards a Multi-softcore FPGA Approach for the HOG Algorithm

Authors
Mascagni de Holanda, JAM; Paiva Cardoso, JMP; Marques, E;

Publication
2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN)

Abstract
Object detection in images is a computing demanding task which usually needs to deal with the detection of different classes of objects, and thus requiring variations and adaptations easily provided by software solutions. Object detection algorithms are being part of real-time smarter embedded systems, such as automotive, medical, robotics and security systems. In most embedded systems, efficient implementations of object oriented algorithms need to provide high performance, low power consumption, and programmability to allow greater development flexibility. The Histogram of Oriented Gradients (HOG) is one of the most widely used algorithms for object detection in images. In this paper, we show our work towards mapping the HOG algorithm to an FPGA-based system consisting of multiple Nios II softcore processors and bearing in mind high-performance and programmability issues. We show how to reduce 19x the algorithms execution time by source to source transformations and specially avoiding redundant processing. Furthermore, we show how the use of pipelining processing using three Nios II processors allows a speedup of 49x compared to the embedded baseline application.

2016

The ANTAREX approach to autotuning and adaptivity for energy efficient HPC systems

Authors
Silvano, C; Agosta, G; Cherubin, S; Gadioli, D; Palermo, G; Bartolini, A; Benini, L; Martinovic, J; Palkovic, M; Slaninová, K; Bispo, J; Cardoso, JMP; Abreu, R; Pinto, P; Cavazzoni, C; Sanna, N; Beccari, AR; Cmar, R; Rohou, E;

Publication
Proceedings of the ACM International Conference on Computing Frontiers, CF'16, Como, Italy, May 16-19, 2016

Abstract
The ANTAREX 1 project aims at expressing the application selfadaptivity through a Domain Specific Language (DSL) and to runtime manage and autotune applications for green and heterogeneous High Performance Computing (HPC) systems up to Exascale. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well as their enforcement at runtime through application autotuning and resource and power management. We show through a mini-App extracted from one of the project application use cases some initial exploration of application precision tuning by means enabled by the DSL. © 2016 Copyright held by the owner/author(s).

2016

Pipelining data-dependent tasks in FPGA-based multicore architectures

Authors
Azarian, A; Cardoso, JMP;

Publication
MICROPROCESSORS AND MICROSYSTEMS

Abstract
In recent years, there has been increasing interest in using task-level pipelining to accelerate the overall execution of applications mainly consisting of producer/consumer tasks. This paper proposes fine- and coarse-grained data synchronization approaches to achieve pipelining execution of producer/consumer tasks in FPGA-based multicore architectures. Our approaches are able to speedup the overall execution of successive, data-dependent tasks, by using multiple cores and specific customization features provided by FPGAs. An important component of our approach is the use of customized inter-stage buffer schemes to communicate data and to synchronize the cores associated with the producer/consumer tasks. We propose techniques to reduce the number of accesses to external memory in our fine-grained data synchronization approach. The experimental results show the feasibility of the approach in both in-order and out-of-order producer/consumer tasks. Moreover, the results using our approach reveal noticeable performance improvements for a number of benchmarks over a single core implementation without using task-level pipelining.

  • 361
  • 641