Untitled

Liang, Xuejun; Jean, Jack; Tomko, Karen

doi:10.1023/a:1011196613858

Cited by 30 publications

(1 citation statement)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Ritcher et al [2012], for instance, a generic tunable VHDL template has been proposed to parallelize 3D stencil computations. Their work uses the so-called Full Buffering [Liang et al 2001] instead of Partial Buffering (which is a strategy where solely the data needed by the current computation is stored to minimize memory consumption), a technique in which data is stored in the on-chip memory until all the computations depending on it have completed, showing that the increasing number of available resources in modern FPGAs allows one to obtain very good performance. However, the contributions of this work do not include either an explicitly streaming mechanism or a scalable solution (i.e., capable of targeting multiple processing elements with adequate memory and bandwidth considerations), which we do in our work.…”

Section: Custom Architecturesmentioning

confidence: 99%

On How to Accelerate Iterative Stencil Loops

Cattaneo

Natale

Sicignano

et al. 2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

In high-performance systems, stencil computations play a crucial role as they appear in a variety of different fields of application, ranging from partial differential equation solving, to computer simulation of particles' interaction, to image processing and computer vision. The computationally intensive nature of those algorithms created the need for solutions to efficiently implement them in order to save both execution time and energy. This, in combination with their regular structure, has justified their widespread study and the proposal of largely different approaches to their optimization. However, most of these works are focused on aggressive compile time optimization, cache locality optimization, and parallelism extraction for the multicore/multiprocessor domain, while fewer works are focused on the exploitation of custom architectures to further exploit the regular structure of Iterative Stencil Loops (ISLs), specifically with the goal of improving power efficiency. This work introduces a methodology to systematically design power-efficient hardware accelerators for the optimal execution of ISL algorithms on Field-programmable Gate Arrays (FPGAs). As part of the methodology, we introduce the notion of Streaming Stencil Time-step (SST), a streaming-based architecture capable of achieving both low resource usage and efficient data reuse thanks to an optimal data buffering strategy, and we introduce a technique called SSTs queuing that is capable of delivering a pseudolinear execution time speedup with constant bandwidth. The methodology has been validated on significant benchmarks on a Virtex-7 FPGA using the Xilinx Vivado suite. Results demonstrate how the efficient usage of the on-chip memory resources realized by an SST allows one to treat problem sizes whose implementation would otherwise not be possible via direct synthesis of the original, unmanipulated code via High-Level Synthesis (HLS). We also show how the SSTs queuing effectively ensures a pseudolinear throughput speedup while consuming constant off-chip bandwidth. CCS Concepts: r Hardware → Hardware-software codesign; Methodologies for EDA; Sequential circuits; r Software and its engineering → Data flow architectures; r Theory of computation → Streaming models; Massively parallel algorithms;

show abstract

Section: Custom Architecturesmentioning

confidence: 99%

On How to Accelerate Iterative Stencil Loops

Cattaneo

Natale

Sicignano

et al. 2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

Realizing real‐time centroid detection of multiple objects with marching pixels algorithms on programmable customizing hardware

Fey

Reichenbach

Komann

et al. 2011

Concurrency and Computation

View full text Add to dashboard Cite

In this paper, we present a class of emergent algorithms called Marching Pixels and a corresponding programmable parallel chip architecture. Marching Pixels can be used for real-time image processing in smart camera chips. They are based on hardware agents, which are virtually crawling in a pixel grid image to find attributes like centroid, rotation, and size of an arbitrary number of objects given in an image. Because of the distributed and local processing scheme of Marching Pixels, reply times in milliseconds can be fulfilled. This means that time is determined where pre-known objects are located and how they are oriented to the main axes of the image. We present an example Marching Pixels algorithm and corresponding applicationspecific and programmable parallel architectures. The latter contains a specific instruction set that allows not only the execution of Marching Pixels algorithms but also of arbitrary Cellular Automata algorithms as an embedded parallel processor. The strengths and weaknesses of this architecture concerning the realization as field-programmable gate arrays and application-specific integrated circuits are discussed by means of hardware synthesis results. These results are compared with the solution achievable on a real hardware like the Atom processor. the camera in a way that work can be done in the requested time. The problem is, if image processing is serially performed using classic algorithms, it will be too slow even on fast single processors. For clarification, an example is given now. Let the image have classic VGA resolution of 640 480 pixels. This means a serial processor has to compute 307.200 pixels within the required reply time of 10 ms. Hence, only 32.6 ns remain for the computation of each pixel. To carry out a reasonable number of operations within this time range would require a clock cycle time of several GHz that one would like to avoid in embedded systems because of the high energy dissipation at such high frequencies.Our answer in order to both meet the strict real-time requirements and to be scalable with regard to increasing pixel resolutions are appropriate parallel low-level, that is hardware-oriented, algorithms, which are based mainly on local operators. These algorithms must not only work in parallel to be fast but also in a distributed way to be robust as well.The scalability in our algorithms is fulfilled by a kind of autonomous agents, which are instructed with the task to travel virtually within a pixel grid, which corresponds to the image, in order to find the centroids coordinates of objects given in this image. After the agents have found them, the corresponding coordinates are given to a robot control to enable the robot to grap the objects. Because these agents march around the pixels, they were given the name Marching Pixels (MPs). They have the goal to visit certain pixels and to gather data about the objects to which the pixels belong. The MP swarm shall further compress the gathered data in order to retrieve desired object information like size, ...

show abstract

Fundamentals and Related Work

Keinert

Teich

2011

Design of Image Processing Embedded Systems Using Multidimensional Data Flow

View full text Add to dashboard Cite

Untitled

Cited by 30 publications

References 6 publications

On How to Accelerate Iterative Stencil Loops

On How to Accelerate Iterative Stencil Loops

Realizing real‐time centroid detection of multiple objects with marching pixels algorithms on programmable customizing hardware

Fundamentals and Related Work

Contact Info

Product

Resources

About