The advantages and the flexibility introduced into the hardware implementation by partial dynamic reconfiguration have rapidly changed the design flow of embedded systems. Although nowadays it is common to deal with systems characterized by a dynamic architecture able to manage and to adapt themselves to extremely different working scenarios, it is not so easy to provide such flexibility also into the software part of these systems. In order to cope with this problem we developed an innovative modular Linux driver that greatly simplifies the software handling of reconfiguration, allowing the programmer to concentrate on a hierarchical view of the system to be implemented. This methodology can be applied to different architectures providing a powerful and flexible software solution and, at the same time, it can be easily customized to respond to specific behaviors and requirement
In high-performance systems, stencil computations play a crucial role as they appear in a variety of different fields of application, ranging from partial differential equation solving, to computer simulation of particles' interaction, to image processing and computer vision. The computationally intensive nature of those algorithms created the need for solutions to efficiently implement them in order to save both execution time and energy. This, in combination with their regular structure, has justified their widespread study and the proposal of largely different approaches to their optimization. However, most of these works are focused on aggressive compile time optimization, cache locality optimization, and parallelism extraction for the multicore/multiprocessor domain, while fewer works are focused on the exploitation of custom architectures to further exploit the regular structure of Iterative Stencil Loops (ISLs), specifically with the goal of improving power efficiency. This work introduces a methodology to systematically design power-efficient hardware accelerators for the optimal execution of ISL algorithms on Field-programmable Gate Arrays (FPGAs). As part of the methodology, we introduce the notion of Streaming Stencil Time-step (SST), a streaming-based architecture capable of achieving both low resource usage and efficient data reuse thanks to an optimal data buffering strategy, and we introduce a technique called SSTs queuing that is capable of delivering a pseudolinear execution time speedup with constant bandwidth. The methodology has been validated on significant benchmarks on a Virtex-7 FPGA using the Xilinx Vivado suite. Results demonstrate how the efficient usage of the on-chip memory resources realized by an SST allows one to treat problem sizes whose implementation would otherwise not be possible via direct synthesis of the original, unmanipulated code via High-Level Synthesis (HLS). We also show how the SSTs queuing effectively ensures a pseudolinear throughput speedup while consuming constant off-chip bandwidth. CCS Concepts: r Hardware → Hardware-software codesign; Methodologies for EDA; Sequential circuits; r Software and its engineering → Data flow architectures; r Theory of computation → Streaming models; Massively parallel algorithms;
Abstract-Designing applications for heterogeneous systems, like Multiprocessor System-on-Chips (MPSoCs) based on Field Programmable Gate Arrays (FPGAs) is a complex task. In order to exploit all the capabilities of these systems, such as Partial Dynamic Reconfiguration (PDR) and hardware acceleration, the designer still has to develop large parts of the system unassisted, establishing the design choices (i.e., whether to assign a task of the application on a hardware region of the FPGA or a general purpose processor of the SoC) mostly on his/her experience.In this paper we present a Mixed-Integer Linear Programming (MILP) formulation for mapping and scheduling of applications on heterogeneous and reconfigurable devices taking into account PDR, module reuse and configuration prefetching. Starting from a target architecture and a description of the application in terms of tasks and data dependencies, the proposed formulation allows the designer to optimize a linear combination of different metrics such as execution time, peak power and energy consumption.
The determination of the optical flow is a central problem in image processing, as it allows to describe how an image changes over time by means of a numerical vector field. The estimation of the optical flow is however a very complex problem, which has been faced using many different mathematical approaches. A large body of work has been recently published about variational methods, following the technique for total variation minimization proposed by Chambolle. Still, their hardware implementations do not offer good performance in terms of frames that can be processed per time unit, mainly because of the complex dependency scheme among the data. In this work, we propose a highly parallel and accelerated FPGA implementation of the Chambolle algorithm, which splits the original image into a set of overlapping sub-frames and efficiently exploits the reuse of intermediate results. We validate our hardware on large frames (up to 1024 × 768), and the proposed approach significantly improves state-of-the-art implementations, reaching up to 76× speedups, which enables real-time frame rates even at high resolutions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.