Single-graph multiple flows: Energy efficient design alternative for GPGPUs

Voitsechov, Dani; Etsion, Yoav

doi:10.1109/isca.2014.6853234

Cited by 21 publications

(22 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such accesses can be attributed to two major causes: using the memory (global or local) for interthread communication, and having multiple threads access the same memory locations. In this paper we introduce direct inter-thread communication to the previously proposed multithreaded coarse-grain reconfigurable array (MT-CGRA) [6,7]. The proposed dMT-CGRA architecture eliminates redundant memory accesses by allowing threads to directly communicate through the CGRA fabric.…”

Section: Discussionmentioning

confidence: 99%

“…The MT-CGRA execution model combines the static and dynamic dataflow models to execute single-instruction multiple-threads (SIMT) programs with better performance and power characteristics than von Neumann GPGPUs [6]. The model converts SIMT kernels into dataflow graphs and maps them to the CGRA fabric, where each functional unit multiplexes its operation on tokens from different instances of a dataflow graph (i.e., threads).…”

Section: Execution and Programming Modelmentioning

confidence: 99%

“…Nevertheless, neither of these architecture support simultaneous dynamic dataflow execution of threads on the same grid. While, SGMF [6] and VGIW [7] do support simultaneous dynamic multithreaded execution on the same grid, they do not support inter thread communication between threads.…”

Section: Related Work Dataflow Architectures and Cgrasmentioning

confidence: 99%

“…Seeking an alternative to the von Neumann GPGPU model, both the research community and industry began exploring dataflow-based systolic and coarse-grained reconfigurable architectures (CGRA) [1][2][3][4][5]. As part of this push, Voitsechov and Etsion introduced the massively multithreaded CGRA (MT-CGRA) architecture [6,7], which maps the compute graph of CUDA kernels to a CGRA and uses the dynamic dataflow execution model to run multiple CUDA threads. The MT-CGRA architecture leverages the direct connectivity between functional units, provided by the CGRA fabric, to eliminate multiple von Neumann bottlenecks including the register file and instruction control.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Inter-Thread Communication in Multithreaded, Reconfigurable Coarse-Grain Arrays

Voitsechov

Port

Etsion

2018

2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

Traditional von Neumann GPGPUs only allow threads to communicate through memory on a group-to-group basis. In this model, a group of producer threads writes intermediate values to memory, which are read by a group of consumer threads after a barrier synchronization. To alleviate the memory bandwidth imposed by this method of communication, GPGPUs provide a small scratchpad memory that prevents intermediate values from overloading DRAM bandwidth. In this paper we introduce direct inter-thread communications for massively multithreaded CGRAs, where intermediate values are communicated directly through the compute fabric on a point-to-point basis. This method avoids the need to write values to memory, eliminates the need for a dedicated scratchpad, and avoids workgroup-global barriers. The paper introduces the programming model (CUDA) and execution model extensions, as well as the hardware primitives that facilitate the communication. Our simulations of Rodinia benchmarks running on the new system show that direct inter-thread communication provides an average speedup of 4.5× (13.5× max) and reduces system power by an average of 7× (33× max), when compared to an equivalent Nvidia GPGPU.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Execution and Programming Modelmentioning

confidence: 99%

Section: Related Work Dataflow Architectures and Cgrasmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Inter-Thread Communication in Multithreaded, Reconfigurable Coarse-Grain Arrays

Voitsechov

Port

Etsion

2018

2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The dataflow co-processor in their system did not drive a display directly, but performed pre-processing on 3D data, a function not dissimilar to modern day vertex or geometry shaders. Voitsechov & Etsion present an alternative architecture for GPGPUs based on dataflow computing [15]. In their architecture instructions from CUDA kernels are mapped to a dataflow graph.…”

Section: Related Workmentioning

confidence: 99%

Ultra low latency dataflow renderer

Friston

Steed

Tilbury

et al. 2015

2015 25th International Conference on Field Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

Abstract-In highly interactive applications, low latency (the time between a user's action, and the response to this action) is critical for a good user experience. Traditional GPU architectures can make very low latencies difficult to achieve. This is because they are designed first and foremost to implement the painter's algorithm -a rendering algorithm that trades-off visual realism for moderate computational speed and high scene dynamism. The dataflow programming paradigm, along with dedicated toolchains such as Maxeler's MaxCompiler, enable the design of application-specific graphics accelerators. Such accelerators, however, have the advantage that their architecture can be completely customised. In this paper we present a custom renderer that composites 2D sprites and maps to emulate a graphical user interface. It was designed to facilitate user interaction tests described in our previous work. Our design is ultra low latency, updating what is being driven to the display within 1 ms of receiving user input. This is far lower than traditional GPUs. We describe the operation of our renderer, and our novel DVI display driver output stage. We measure a latency of under 1 ms for our renderer, with an end-to-end delay of 6 ms for our whole apparatus. We compare this with the end-to-end latency of the same apparatus built with a modern GPU, which we measure at 20 ms.

show abstract

Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism

Feng

et al. 2022

J. Comput. Sci. Technol.

View full text Add to dashboard Cite

Single-graph multiple flows: Energy efficient design alternative for GPGPUs

Cited by 21 publications

References 38 publications

Inter-Thread Communication in Multithreaded, Reconfigurable Coarse-Grain Arrays

Inter-Thread Communication in Multithreaded, Reconfigurable Coarse-Grain Arrays

Ultra low latency dataflow renderer

Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism

Contact Info

Product

Resources

About