2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016
DOI: 10.1109/ipdps.2016.36
|View full text |Cite
|
Sign up to set email alerts
|

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 26 publications
0
7
0
Order By: Relevance
“…This memory segment is used to store the configurations and arguments that threads in the group pass to their children. Next, each thread that launches a child grid (has a non-zero grid dimension) atomically increments two global counters simultaneously (lines [19][20]: (1) _numParents to assign an index to the parent thread so that the thread knows where to store its arguments and configuration, and (2) _sumGDim to find the total number of child blocks of prior parent threads which we use to initialize the scanned array of grid dimensions. The two global counters are incremented simultaneously by treating them as a single 64-bit integer.…”
Section: A Multi-block Granularity Aggregationmentioning
confidence: 99%
“…This memory segment is used to store the configurations and arguments that threads in the group pass to their children. Next, each thread that launches a child grid (has a non-zero grid dimension) atomically increments two global counters simultaneously (lines [19][20]: (1) _numParents to assign an index to the parent thread so that the thread knows where to store its arguments and configuration, and (2) _sumGDim to find the total number of child blocks of prior parent threads which we use to initialize the scanned array of grid dimensions. The two global counters are incremented simultaneously by treating them as a single 64-bit integer.…”
Section: A Multi-block Granularity Aggregationmentioning
confidence: 99%
“…The main idea is to gather enough iterations that enters the same direction of the if‐else statement and execute them at once. CTE 19 tried to gather fine‐grained tasks from multiple coarse‐grained tasks using a prefix‐sum and binary search technique similar to Algorithm 1. However, both CCC and CTE only handle load imbalance inside a warp and can not distribute workload to different warps or TBs.…”
Section: Related Workmentioning
confidence: 99%
“…In summary, all the related work lack some advantages of PFACC, including reasonable memory usage (lack by NESLGPU, 13 Nessie, 14 and CuNesl 15 ), take thread hierarchy into account (lack by NESLGPU, 13 Nessie, 14 and PiecewiseNDP 12 ), unlimited nesting depth (lack by CDP, OpenACC, CopperHead, 16 and Hidp 17 ), and able to distribute workload around all threads in a kernel (lack by CCC 18 and CTE 19 ).…”
Section: Related Workmentioning
confidence: 99%
“…Each of these two warps have partially active lanes, and the warps have to be executed one after another. On completion of the execution of both paths, the warps rejoin to continue normal execution as a single warp [52,53].…”
Section: Partial-lanementioning
confidence: 99%