Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

Khorasani, Farzad; Rowe, Bryan; Gupta, Rajiv; Bhuyan, Laxmi N.

doi:10.1109/ipdps.2016.36

Cited by 12 publications

(7 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This memory segment is used to store the configurations and arguments that threads in the group pass to their children. Next, each thread that launches a child grid (has a non-zero grid dimension) atomically increments two global counters simultaneously (lines [19][20]: (1) _numParents to assign an index to the parent thread so that the thread knows where to store its arguments and configuration, and (2) _sumGDim to find the total number of child blocks of prior parent threads which we use to initialize the scanned array of grid dimensions. The two global counters are incremented simultaneously by treating them as a single 64-bit integer.…”

Section: A Multi-block Granularity Aggregationmentioning

confidence: 99%

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Olabi¹,

Gómez-Luna²,

Mutlu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small grids are launched. The large number of launches results in high launch latency due to congestion, and the small grid sizes result in hardware underutilization.To address this issue, we propose a compiler framework for optimizing the use of dynamic parallelism in applications with nested parallelism. The framework features three key optimizations: thresholding, coarsening, and aggregation. Thresholding involves launching a grid dynamically only if the number of child threads exceeds some threshold, and serializing the child threads in the parent thread otherwise. Coarsening involves executing the work of multiple thread blocks by a single coarsened block to amortize the common work across them. Aggregation involves combining multiple child grids into a single aggregated grid.Thresholding is sometimes applied manually by programmers in the context of dynamic parallelism. We automate it in the compiler and discuss the challenges associated with doing so. Coarsening is sometimes applied as an optimization in other contexts. We propose to apply coarsening in the context of dynamic parallelism and automate it in the compiler as well. Aggregation has been automated in the compiler by prior work. We enhance aggregation by proposing a new aggregation technique that uses multi-block granularity. We also integrate these three optimizations into an open-source compiler framework to simplify the process of optimizing dynamic parallelism code.Our evaluation shows that our compiler framework improves the performance of applications with nested parallelism by a geometric mean of 43.0× over applications that use dynamic parallelism, 8.7× over applications that do not use dynamic parallelism, and 3.6× over applications that use dynamic parallelism with aggregation alone as proposed in prior work.

show abstract

Section: A Multi-block Granularity Aggregationmentioning

confidence: 99%

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Olabi¹,

Gómez-Luna²,

Mutlu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The main idea is to gather enough iterations that enters the same direction of the if‐else statement and execute them at once. CTE 19 tried to gather fine‐grained tasks from multiple coarse‐grained tasks using a prefix‐sum and binary search technique similar to Algorithm 1. However, both CCC and CTE only handle load imbalance inside a warp and can not distribute workload to different warps or TBs.…”

Section: Related Workmentioning

confidence: 99%

“…In summary, all the related work lack some advantages of PFACC, including reasonable memory usage (lack by NESLGPU, 13 Nessie, 14 and CuNesl 15 ), take thread hierarchy into account (lack by NESLGPU, 13 Nessie, 14 and PiecewiseNDP 12 ), unlimited nesting depth (lack by CDP, OpenACC, CopperHead, 16 and Hidp 17 ), and able to distribute workload around all threads in a kernel (lack by CCC 18 and CTE 19 ).…”

Section: Related Workmentioning

confidence: 99%

PFACC: An OpenACC‐like programming model for irregular nested parallelism

Huang

Yang

2020

Softw Pract Exp

View full text Add to dashboard Cite

Summary OpenACC is a directive‐based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration‐sharing and memory allocation routines. The PFACC runtime iteration‐sharing routine is a two‐level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth‐first order. Different thread blocks share iterations among one another with an iteration‐stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth‐first execution order. The two‐level iteration‐sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.

show abstract

“…Each of these two warps have partially active lanes, and the warps have to be executed one after another. On completion of the execution of both paths, the warps rejoin to continue normal execution as a single warp [52,53].…”

Section: Partial-lanementioning

confidence: 99%

Itap

Sadrosadati

Ehsani

Falahati³

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) are widely used as the accelerator of choice for applications with massively data-parallel tasks. However, recent studies show that GPUs suffer heavily from resource underutilization, which, combined with their large static power consumption, imposes a significant power overhead. One of the most power-hungry components of a GPU-the execution units-frequently experience idleness when (1) an underutilized warp is issued to the execution units, leading to partial lane idleness, and (2) there is no active warp to be issued for the execution due to warp stalls (e.g., waiting for memory access and synchronization). Although large in total, the idle time of execution units actually comes from short but frequent stalls, leaving little potential for common power saving techniques, such as power-gating. In this article, we propose ITAP, a novel idle-time-aware power management technique, which aims to effectively reduce the static energy consumption of GPU execution units. By taking advantage of different power management techniques (i.e., power-gating and different levels of voltage scaling), ITAP employs three static power reduction modes with different overheads and capabilities of static power reduction. ITAP estimates the idle period length of execution units using prediction and peek-ahead techniques in a synergistic way and then applies the most appropriate static power reduction mode based on the estimated idle period M. Sadrosadati performed part of this work at ETH Zürich. L. Orosa was supported by FAPESP fellowship 2016/18929-4.

show abstract

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

Cited by 12 publications

References 26 publications

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

PFACC: An OpenACC‐like programming model for irregular nested parallelism

Itap

Contact Info

Product

Resources

About