Fine-grain task aggregation and coordination on GPUs

Orr, Marc; Beckmann, Bradford M.; Reinhardt, Steven K.; Wood, David А.

doi:10.1145/2678373.2665701

Cited by 12 publications

(8 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, several attempts have been made to take GPUs' inherently data-parallel execution model and adapt it to target task-parallel programs [1,17,36]. Perhaps most related to our work, Orr et al [29] provide a hardware implementation of the channels model proposed by Gaster and Howes [12] and offer a mapping from simple Cilkstyle programs to their channels implementation. Interestingly, the execution model imposed by channels on these programs resembles the level-by-level breadth-first execution strategy of our initial code transformation.…”

Section: Related Workmentioning

confidence: 99%

“…To address this shortcoming, there have been many proposals to map coarse-grained tasks to commodity GPUs [1,36] or to modify GPU hardware to better accommodate recursive parallelism with fine-grained tasks [17,29,33]. In this paper, we consider the problem of effectively mapping fine-grained, recursive, parallel applications to commodity vector units.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient execution of recursive programs on commodity vector hardware

Ren

Krishnamoorthy

et al. 2015

SIGPLAN Not.

View full text Add to dashboard Cite

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel's SSE4.2 vector units, as well as accelerators using Intel's AVX512 units.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Efficient execution of recursive programs on commodity vector hardware

Ren

Krishnamoorthy

et al. 2015

SIGPLAN Not.

View full text Add to dashboard Cite

show abstract

“…Steffen et al [34] propose the idea of dynamic micro-kernel architecture for global rendering algorithm which supports dynamically spawning threads as a new warp to execute a subsection of the parent threads code. Orr et al [29] design a task aggregation framework on GPU based on the channel abstraction proposed by Gaster et al [14]. Each channel is defined as a finite queue in virtual memory (global memory space that is visible to both CPU and GPU) whose elements are dynamically generated tasks that execute the same kernel function.…”

Section: Related Workmentioning

confidence: 99%

Dynamic thread block launch

Wang

Rubin

Sidelnik

et al. 2015

Proceedings of the 42nd Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential.We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.

show abstract

“…In this work, we improve the performance and energy consumption of GPU-initiated communication using a littleknown feature of modern GPUs: embedded, programmable microprocessors that are typically referred to as Command Processors (CPs). These processors exist on the GPU device itself and are utilized to perform the serial tasks involved with launching and tearing down a GPU kernel [4,30]. However, in the presence of intra-kernel networking, programmers are encouraged to use larger (less) kernels, as they no longer need to break down kernels across network communication points.…”

Section: Introductionmentioning

confidence: 99%

“…In this paper, we assume a programmable CP implemented as a general-purpose CPU with private L1 instruction and data caches. The CP is hooked up to the GPU through a shared L2 cache, as described in the prior art [30].…”

Section: Introductionmentioning

confidence: 99%

ComP-net

LeBeane

Hamidouche

Benton

et al. 2018

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

Current state-of-the-art in GPU networking advocates a hostcentric model that reduces performance and increases code complexity. Recently, researchers have explored several techniques for networking within a GPU kernel itself. These approaches, however, sufer from high latency, waste energy on the host, and are not scalable with larger/more GPUs on a node. In this work, we introduce Command Processor Networking (ComP-Net), which leverages the availability of scalar cores integrated on the GPU itself to provide highperformance intra-kernel networking. ComP-Net enables eicient synchronization between the Command Processors and Compute Units on the GPU through a line locking scheme implemented in the GPU's shared last-level cache. We illustrate that ComP-Net can improve application performance by up to 20% and provide up to 50% reduction in energy consumption vs. competing networking techniques across a Jacobi stencil, allreduce collective, and machine learning applications. CCS CONCEPTS • Computer systems organization → Heterogeneous (hybrid) systems;

show abstract

Fine-grain task aggregation and coordination on GPUs

Cited by 12 publications

References 19 publications

Efficient execution of recursive programs on commodity vector hardware

Efficient execution of recursive programs on commodity vector hardware

Dynamic thread block launch

ComP-net

Contact Info

Product

Resources

About