Efficient warp execution in presence of divergence with collaborative context collection

Khorasani, Farzad; Gupta, Rajiv; Bhuyan, Laxmi N.

doi:10.1145/2830772.2830796

Cited by 30 publications

(14 citation statements)

References 53 publications

(56 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CCC 18 tried to increase warp execution efficiency when each thread process repetitive tasks with divergent paths, such as a loop contains if‐else statements in the loop body. The main idea is to gather enough iterations that enters the same direction of the if‐else statement and execute them at once.…”

Section: Related Workmentioning

confidence: 99%

PFACC: An OpenACC‐like programming model for irregular nested parallelism

Huang

Yang

2020

Softw Pract Exp

View full text Add to dashboard Cite

Summary OpenACC is a directive‐based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration‐sharing and memory allocation routines. The PFACC runtime iteration‐sharing routine is a two‐level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth‐first order. Different thread blocks share iterations among one another with an iteration‐stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth‐first execution order. The two‐level iteration‐sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.

show abstract

Section: Related Workmentioning

confidence: 99%

PFACC: An OpenACC‐like programming model for irregular nested parallelism

Huang

Yang

2020

Softw Pract Exp

View full text Add to dashboard Cite

show abstract

“…Each of these two warps have partially active lanes, and the warps have to be executed one after another. On completion of the execution of both paths, the warps rejoin to continue normal execution as a single warp [52,53].…”

Section: Partial-lanementioning

confidence: 99%

Itap

Sadrosadati

Ehsani

Falahati³

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) are widely used as the accelerator of choice for applications with massively data-parallel tasks. However, recent studies show that GPUs suffer heavily from resource underutilization, which, combined with their large static power consumption, imposes a significant power overhead. One of the most power-hungry components of a GPU-the execution units-frequently experience idleness when (1) an underutilized warp is issued to the execution units, leading to partial lane idleness, and (2) there is no active warp to be issued for the execution due to warp stalls (e.g., waiting for memory access and synchronization). Although large in total, the idle time of execution units actually comes from short but frequent stalls, leaving little potential for common power saving techniques, such as power-gating. In this article, we propose ITAP, a novel idle-time-aware power management technique, which aims to effectively reduce the static energy consumption of GPU execution units. By taking advantage of different power management techniques (i.e., power-gating and different levels of voltage scaling), ITAP employs three static power reduction modes with different overheads and capabilities of static power reduction. ITAP estimates the idle period length of execution units using prediction and peek-ahead techniques in a synergistic way and then applies the most appropriate static power reduction mode based on the estimated idle period M. Sadrosadati performed part of this work at ETH Zürich. L. Orosa was supported by FAPESP fellowship 2016/18929-4.

show abstract

“…For applications that seldom use the shared memory, the shared memory can also be used to store temporary context information. For example, to compact divergent threads, the relevant registers of divergent threads can be collected in a warp-specific stack allocated in the shared memory and restores the registers only when the perfect utilization of warp lanes becomes feasible [15]. To maximize the thread parallelism by assigning threads up to the register file limit instead of the scheduling limit [37], the context information of thread blocks that are currently not considered for scheduling can be stored in the shared memory temporarily.…”

Section: Using Unused Shared Memory To Store Context Informationmentioning

confidence: 99%

Untitled

2019

TACO

View full text Add to dashboard Cite

GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests. Specifically, concurrent memory requests accessing contiguous memory space are coalesced into warp-wide accesses. To support such large accesses to L1 cache with low latency, the size of L1 cache line is no smaller than that of warp-wide accesses. However, such L1 cache architecture cannot always be efficiently utilized when applications generate many memory requests with irregular access patterns especially due to branch and memory divergences that make requests uncoalesced

show abstract

Efficient warp execution in presence of divergence with collaborative context collection

Cited by 30 publications

References 53 publications

PFACC: An OpenACC‐like programming model for irregular nested parallelism

PFACC: An OpenACC‐like programming model for irregular nested parallelism

Itap

Untitled

Contact Info

Product

Resources

About