Cookies
O website necessita de alguns cookies e outros recursos semelhantes para funcionar. Caso o permita, o INESC TEC irá utilizar cookies para recolher dados sobre as suas visitas, contribuindo, assim, para estatísticas agregadas que permitem melhorar o nosso serviço. Ver mais
Aceitar Rejeitar
  • Menu
Publicações

Publicações por CTM

2020

A Dynamically Reconfigurable Dual-Waveform Baseband Modulator for Flexible Wireless Communications

Autores
Ferreira, ML; Ferreira, JC;

Publicação
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY

Abstract
In future wireless communication systems, several radio access technologies will coexist and interwork to provide a great variety of services with different requirements. Thus, the design of flexible and reconfigurable hardware is a relevant topic in wireless communications. The combination of high performance, programmability and flexibility makes Field-programmable gate array a convenient platform to design such systems, especially for base stations. This paper describes a dynamically reconfigurable baseband modulator for Orthogonal Frequency Division Multiplexing and Filter-bank Multicarrier modulation waveforms implemented on a Virtex-7 board. The design features Dynamic Partial Reconfiguration (DPR) capabilities to adapt its mode of operation at run-time and is compared with a functionally equivalent static multi-mode design regarding processing throughput, resource utilization, functional density and power consumption. The DPR-based design implementation reserves about half the resources used by static multi-mode counterpart. Consequently, the baseband processing dynamic power consumption observed in the DPR-based design is between 26 mW to 90 mW lower than in the static multi-mode design, representing a dynamic power reduction between 13% to 52%. The worst-case DPR latency measured was 1.051 ms, while the DPR energy overhead is below 1.5 mJ. Considering latency requirements for modern wireless standards and power consumption constraints for commercial base stations, the DPR application is shown to be valuable in multi-standard and multi-mode systems, as well as in scenarios such as multiple-input and multiple-output or dynamic spectrum aggregation.

2020

Parallel Implementation of K-Means Algorithm on FPGA

Autores
Dias, LA; Ferreira, JC; Fernandes, MAC;

Publicação
IEEE ACCESS

Abstract
The K-means algorithm is widely used to find correlations between data in different application domains. However, given the massive amount of data stored, known as Big Data, the need for high-speed processing to analyze data has become even more critical, especially for real-time applications. A solution that has been adopted to increase the processing speed is the use of parallel implementations on FPGA, which has proved to be more efficient than sequential systems. Hence, this paper proposes a fully parallel implementation of the K-means algorithm on FPGA to optimize the system & x2019;s processing time, thus enabling real-time applications. This proposal, unlike most implementations proposed in the literature, even parallel ones, do not have sequential steps, a limiting factor of processing speed. Results related to processing time (or throughput) and FPGA area occupancy (or hardware resources) were analyzed for different parameters, reaching performances higher than 53 millions of data points processed per second. Comparisons to the state of the art are also presented, showing speedups of more than over a partially serial implementation.

2020

Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey

Autores
Paulin, N; Ferreira, JC; Cardoso, JMP;

Publicação
ACM COMPUTING SURVEYS

Abstract
The breakdown of Dennard scaling has resulted in a decade-long stall of the maximum operating clock frequencies of processors. To mitigate this issue, computing shifted to multi-core devices. This introduced the need for programming flows and tools that facilitate the expression of workload parallelism at high abstraction levels. However, not all workloads are easily parallelizable, and the minor improvements to processor cores have not significantly increased single-threaded performance. Simultaneously, Instruction Level Parallelism in applications is considerably underexplored. This article reviews notable approaches that focus on exploiting this potential parallelism via automatic generation of specialized hardware from binary code. Although research on this topic spans over more than 20 years, automatic acceleration of software via translation to hardware has gained new importance with the recent trend toward reconfigurable heterogeneous platforms. We characterize this kind of binary acceleration approach and the accelerator architectures on which it relies. We summarize notable state-of-the-art approaches individually and present a taxonomy and comparison. Performance gains from 2.6x to 5.6x are reported, mostly considering bare-metal embedded applications, along with power consumption reductions between 1.3x and 3.9x. We believe the methodologies and results achievable by automatic hardware generation approaches are promising in the context of emergent reconfigurable devices.

2020

A Multifunctional Integrated Circuit Router for Body Area Network Wearable Systems

Autores
Miyandoab, FD; Ferreira, JC; Tavares, VMG; da Silva, JM; Velez, FJ;

Publicação
IEEE-ACM TRANSACTIONS ON NETWORKING

Abstract
A multifunctional router IC to be included in the nodes of a wearable body sensor network is described and evaluated. The router targets different application scenarios, especially those including tens of sensors, embedded into textile materials and with high data-rate communication demands. The router IC supports two different functionality sets, one for sensor nodes and another for the base node, both based on the same circuit module. The nodes are connected to each other by means of woven thick conductive yarns forming a mesh topology with the base node at the center. From the standpoint of the network, each sensor node is a four port router capable of handling packets from destination nodes to the base node, with sufficient redundant paths. The adopted hybrid circuit and packet switching scheme significantly improve network performance in terms of end-to-end delay, throughput and power consumption. The IC also implements a highly precise, sub-microsecond one-way time synchronization protocol which is used for time stamping the acquired data. The communication module was implemented in a 4-metal, 0.35 mu m CMOS technology. The maximum data rate of the system is 35 Mbps while supporting up to 250 sensors, which exceeds current BAN applications scenarios.

2020

Optimizing OpenCL Code for Performance on FPGA: k-Means Case Study With Integer Data Sets

Autores
Paulino, N; Ferreira, JC; Cardoso, JMP;

Publicação
IEEE ACCESS

Abstract
High Level Synthesis (HLS) tools targeting Field Programmable Gate Arrays (FPGAs) aim to provide a method for programming these devices via high-level abstractions. Initially, HLS support for FPGAs focused on compiling C/C CC to hardware circuits. This raised the issue of determining the programming practices which resulted in the best performing circuits. Recently, to further increase the applicability of HLS approaches, renewed effort was placed on support for HLS of OpenCL code for FPGA, raising the same issues of coding practices and performance portability. This paper explores the performance of OpenCL code compiled for FPGAs for different coding techniques. We evaluate the use of task-kernels versus NDRange kernels, data vectorization, the use of on-chip local memories, and data transfer optimizations by exploiting burst access inference. We present this exploration via a case study of the k-means algorithm, and produce a total of 10 OpenCL implementations of the kernel. To determine the effects of different data set characteristics, and to determine the gains from specialization based on number of attributes, we generated a total of 12 integer data sets. The data sets vary regarding the number of instances, number of attributes (i.e., features), and number of clusters. We also vary the number of processing cores, and present the resulting required resources and operating frequencies. Finally, we execute the same OpenCL code on a 4 GHz Intel i7-6700K CPU, showing that the FPGA achieves speedups up to 1.54 x for four cases, and energy savings up to 80% in all cases.

2020

Executing ARMv8 Loop Traces on Reconfigurable Accelerator via Binary Translation Framework

Autores
Paulino, N; Ferreira, JC; Bispo, J; Cardoso, JMP;

Publicação
2020 30TH INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL)

Abstract
Performance and power efficiency in edge and embedded systems can benefit from specialized hardware. To avoid the effort of manual hardware design, we explore the generation of accelerator circuits from binary instruction traces for several Instruction Set Architectures.

  • 123
  • 375