2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) 2018
DOI: 10.1109/hpca.2018.00040
|View full text |Cite
|
Sign up to set email alerts
|

Warp Scheduling for Fine-Grained Synchronization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 24 publications
(18 citation statements)
references
References 27 publications
0
18
0
Order By: Relevance
“…Note that Union-Async and Union-JTB are lock-free compareand-swap (CAS) implementations, whereas Union-Rem-Lock is a lock-based implementation. Spin-locks are used in Union-Rem-Lock, which can significantly degrade parallelism on GPUs [32], so we also implemented a lock-free version using CAS (Union-Rem-CAS).…”
Section: Finish Algorithmsmentioning
confidence: 99%
“…Note that Union-Async and Union-JTB are lock-free compareand-swap (CAS) implementations, whereas Union-Rem-Lock is a lock-based implementation. Spin-locks are used in Union-Rem-Lock, which can significantly degrade parallelism on GPUs [32], so we also implemented a lock-free version using CAS (Union-Rem-CAS).…”
Section: Finish Algorithmsmentioning
confidence: 99%
“…Yilmazer, et al [46] propose a hardware-accelerated finegrained lock scheme for GPUs, which adds support for queuing locks in L1 and L2 caches and uses a customized communication protocol to enable faster lock transfer and to reduce lock retries for non-coherent caches. ElTantawy, et al [13] propose a hardware warp scheduling policy that reduces lock retries by de-prioritizing warps whose threads are spin waiting. In addition, hardware accelerated locks have also been proposed for CPUs [4,25,42,47].…”
Section: Gpu Solutionsmentioning
confidence: 99%
“…To evaluate our solution, we use three state-of-art GPU implementations of irregular algorithms, which have been shown to compare favorably against CPU implementations [18,22,33], and we use two microbenchmarks. which have been used in previous work on fine-grained locking [12,13,46] and transactional memory [10,15,16,37,45] on GPUs. The two microbenchmarks represent commonly used lock patterns for workloads that manipulate irregular data structures, such as graphs and trees.…”
Section: Benchmarks and Inputsmentioning
confidence: 99%
See 1 more Smart Citation
“…The restart has a similar effect to backoff locking [36], where a spinlocking thread does meaningless work to temporarily relieve contention over the atomic unit; this is useful when DRAM operations are not slow and atomic operations are fast so that the backoff window is small. ElTantawy and Aamodt [10] showed that an adaptive backoff improves the performance even further, since small backoff delay may increase spinning overheads while a large backoff delay may throttle warps more than necessary. From our experiments we find that spinlocks on high-contention nodes-specifically, full and leaf nodes during insertionsreduce the amount of resident warps that can make progress.…”
Section: Restarts Instead Of Spinlocksmentioning
confidence: 99%