Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by HumanISE

2018

Impact of Vectorization Over 16-bit Data-Types on GPUs

Authors
Reis, L; Nobre, R; Cardoso, JMP;

Publication
PARMA-DITAM 2018: 9TH WORKSHOP ON PARALLEL PROGRAMMING AND RUNTIME MANAGEMENT TECHNIQUES FOR MANY-CORE ARCHITECTURES AND 7TH WORKSHOP ON DESIGN TOOLS AND ARCHITECTURES FOR MULTICORE EMBEDDED COMPUTING PLATFORMS

Abstract
Since the introduction of Single Instruction Multiple Thread (SIMT) GPU architectures, vectorization has seldom been recommended. However, for efficient use of 8-bit and 16-bit data types, vector types are necessary even on these GPUs. When only integer types were natively supported in sizes of less than 32-bits, the usefulness of vectors was limited, but the introduction of hardware support for packed half-precision floating point computations in recent GPU architectures changes this, as now floating-point programs can also benefit from vector types. Given a GPU kernel, using smaller data-types might not be sufficient to achieve the optimal performance for a given device, even on hardware with native support for half-precision, because the compiler targeting the GPU may not able to automatically vectorize the code. In this paper, we present a number of examples that make use of the OpenCL vector data-types, which we are currently implementing in our tool for automatic vectorization. We present a number of experiments targeting a graphics card with an AMD Vega 10 XT GPU, which has 2x peak arithmetic throughput using half-precision when compared with single-precision. For comparison, we also target an older GPU architecture, without native support for half-precision arithmetic. We found that, on an AMD Vega 10 XT GPU, half-precision vectorization leads to performance improvements over the scalar version using the same precision (geometric mean speedup of 1.50x), which can be attributed to the GPU being able to make use of native native support for arithmetic over packed half-precision data. However, we found that most of the performance improvement of vectorization is caused by related transformations, such as thread coarsening or loop unrolling.

2018

Rapid Prototyping and Verification of Hardware Modules Generated Using HLS

Authors
Caba, J; Cardoso, JMP; Rincón, F; Dondo, J; López, JC;

Publication
Applied Reconfigurable Computing. Architectures, Tools, and Applications - 14th International Symposium, ARC 2018, Santorini, Greece, May 2-4, 2018, Proceedings

Abstract
Most modern design suites include HLS tools that rise the design abstraction level and provide a fast and direct flow to programmable devices, getting rid of manually coding at the RTL. While HLS greatly reduces the design productivity gap, non-negligible problems arise. For instance, the co-simulation strategy may not provide trustworthy results due to the variable accuracy of simulation, especially when considering dynamic reconfiguration and access to system busses. This work proposes mechanisms aimed at improving the verification accuracy using a real device and a testing framework. One of the mechanisms is the inclusion of physical configuration macros (e.g., clock rate configuration macro) and test assertions based on physical parameters in the verification environment (e.g., timing assertions). In addition it is possible to change some of those parameters, such as clock speed rate, and check the behavior of a hardware component into an overclocking or underclocking scenario. Our on-board testing flow allows faster FPGA iterations to ensure the design intent and the hardware-design behavior match. This flow uses a real device to carry out the verification process and synthesizes only the DUT generating its partial bitstream in a few minutes. © Springer International Publishing AG, part of Springer Nature 2018.

2018

SOCRATES - A seamless online compiler and system runtime autotuning framework for energy-aware applications

Authors
Gadioli, D; Nobre, R; Pinto, P; Vitali, E; Ashouri, AH; Palermo, G; Cardoso, JMP; Silvano, C;

Publication
2018 Design, Automation & Test in Europe Conference & Exhibition, DATE 2018, Dresden, Germany, March 19-23, 2018

Abstract
Configuring program parallelism and selecting optimal compiler options according to the underlying platform architecture is a difficult task. Tipically, this task is either assigned to the programmer or done by a standard one-fits-all policy generated by the compiler or runtime system. A runtime selection of the best configuration requires the insertion of a lot of glue code for profiling and runtime selection. This represents a programming wall for application developers. This paper presents a structured approach, called SOCRATES, based on an aspect-oriented language (LARA) and a runtime autotuner (mARGOt) to mitigate this problem. LARA has been used to hide the glue code insertion, thus separating the pure functional application description from extra-functional requirements. mARGOT has been used for the automatic selection of the best configuration according to the runtime evolution of the application. 1 © 2018 EDAA.

2018

Autotuning and Adaptivity in Energy Efficient HPC Systems: The ANTAREX Toolbox

Authors
Silvano, C; Palermo, G; Agosta, G; Ashouri, AH; Gadioli, D; Cherubin, S; Vitali, E; Benini, L; Bartolini, A; Cesarini, D; Cardoso, J; Bispo, J; Pinto, P; Nobre, R; Rohou, E; Besnard, L; Lasri, I; Sanna, N; Cavazzoni, C; Cmar, R; Martinovic, J; Slaninova, K; Golasowski, M; Beccari, AR; Manelfi, C;

Publication
2018 ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS

Abstract
Designing and optimizing applications for energy-efficient High Performance Computing systems up to the Exascale era is an extremely challenging problem. This paper presents the toolbox developed in the ANTAREX European project for autotuning and adaptivity in energy efficient HPC systems. In particular, the modules of the ANTAREX toolbox are described as well as some preliminary results of the application to two target use cases.(1)

2018

ANTAREX: A DSL-Based Approach to Adaptively Optimizing and Enforcing Extra-Functional Properties in High Performance Computing

Authors
Silvano, C; Agosta, G; Bartolini, A; Beccari, AR; Benini, L; Besnard, L; Bispo, J; Cmar, R; Cardoso, JMP; Cavazzoni, C; Cherubin, S; Gadioli, D; Golasowski, M; Lasri, I; Martinovic, J; Palermo, G; Pinto, P; Rohou, E; Sanna, N; Slaninová, K; Vitali, E;

Publication
21st Euromicro Conference on Digital System Design, DSD 2018, Prague, Czech Republic, August 29-31, 2018

Abstract
The ANTAREX project relies on a Domain Specific Language (DSL) based on Aspect Oriented Programming (AOP) concepts to allow applications to enforce extra functional properties such as energy-efficiency and performance and to optimize Quality of Service (QoS) in an adaptive way. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well as their enforcement at runtime through application autotuning and resource and power management. In this paper, we present an overview of the ANTAREX DSL and some of its capabilities through a number of examples, including how the DSL is applied in the context of one of the project use cases. © 2018 IEEE.

2018

Proceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems, ANDARE@PACT 2018, Limassol, Cyprus, November 4, 2018

Authors
Bartolini, A; Cardoso, JMP; Silvano, C;

Publication
ANDARE@PACT

Abstract

  • 269
  • 641