2014
DOI: 10.1145/2678373.2665701
|View full text |Cite
|
Sign up to set email alerts
|

Fine-grain task aggregation and coordination on GPUs

Abstract: In general-purpose graphics processing unit (GPGPU) computing, data is processed by concurrent threads execut-ing the same function. This model, dubbed single-instruction/multiple-thread (SIMT), requires programmers to coordinate the synchronous execution of similar opera-tions across thousands of data elements. To alleviate this programmer burden, Gaster and Howes outlined the chan-nel abstraction, which facilitates dynamically aggregating asynchronously produced fine-grain work into coarser-grain tasks. Howe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
8
0

Year Published

2015
2015
2022
2022

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 12 publications
(8 citation statements)
references
References 19 publications
0
8
0
Order By: Relevance
“…In recent years, several attempts have been made to take GPUs' inherently data-parallel execution model and adapt it to target task-parallel programs [1,17,36]. Perhaps most related to our work, Orr et al [29] provide a hardware implementation of the channels model proposed by Gaster and Howes [12] and offer a mapping from simple Cilkstyle programs to their channels implementation. Interestingly, the execution model imposed by channels on these programs resembles the level-by-level breadth-first execution strategy of our initial code transformation.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In recent years, several attempts have been made to take GPUs' inherently data-parallel execution model and adapt it to target task-parallel programs [1,17,36]. Perhaps most related to our work, Orr et al [29] provide a hardware implementation of the channels model proposed by Gaster and Howes [12] and offer a mapping from simple Cilkstyle programs to their channels implementation. Interestingly, the execution model imposed by channels on these programs resembles the level-by-level breadth-first execution strategy of our initial code transformation.…”
Section: Related Workmentioning
confidence: 99%
“…To address this shortcoming, there have been many proposals to map coarse-grained tasks to commodity GPUs [1,36] or to modify GPU hardware to better accommodate recursive parallelism with fine-grained tasks [17,29,33]. In this paper, we consider the problem of effectively mapping fine-grained, recursive, parallel applications to commodity vector units.…”
Section: Introductionmentioning
confidence: 99%
“…Steffen et al [34] propose the idea of dynamic micro-kernel architecture for global rendering algorithm which supports dynamically spawning threads as a new warp to execute a subsection of the parent threads code. Orr et al [29] design a task aggregation framework on GPU based on the channel abstraction proposed by Gaster et al [14]. Each channel is defined as a finite queue in virtual memory (global memory space that is visible to both CPU and GPU) whose elements are dynamically generated tasks that execute the same kernel function.…”
Section: Related Workmentioning
confidence: 99%
“…In this work, we improve the performance and energy consumption of GPU-initiated communication using a littleknown feature of modern GPUs: embedded, programmable microprocessors that are typically referred to as Command Processors (CPs). These processors exist on the GPU device itself and are utilized to perform the serial tasks involved with launching and tearing down a GPU kernel [4,30]. However, in the presence of intra-kernel networking, programmers are encouraged to use larger (less) kernels, as they no longer need to break down kernels across network communication points.…”
Section: Introductionmentioning
confidence: 99%
“…In this paper, we assume a programmable CP implemented as a general-purpose CPU with private L1 instruction and data caches. The CP is hooked up to the GPU through a shared L2 cache, as described in the prior art [30].…”
Section: Introductionmentioning
confidence: 99%