Warp Scheduling for Fine-Grained Synchronization

ElTantawy, Ahmed; Aamodt, Tor M.

doi:10.1109/hpca.2018.00040

Cited by 24 publications

(18 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that Union-Async and Union-JTB are lock-free compareand-swap (CAS) implementations, whereas Union-Rem-Lock is a lock-based implementation. Spin-locks are used in Union-Rem-Lock, which can significantly degrade parallelism on GPUs [32], so we also implemented a lock-free version using CAS (Union-Rem-CAS).…”

Section: Finish Algorithmsmentioning

confidence: 99%

Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs

Hong

Dhulipala

Shun

2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Connected components and spanning forest are fundamental graph algorithms due to their use in many important applications, such as graph clustering and image segmentation. GPUs are an ideal platform for graph algorithms due to their high peak performance and memory bandwidth. While there exist several GPU connectivity algorithms in the literature, many design choices have not yet been explored. In this paper, we explore various design choices in GPU connectivity algorithms, including sampling, linking, and tree compression, for both the static as well as the incremental setting. Our various design choices lead to over 300 new GPU implementations of connectivity, many of which outperform state-ofthe-art. We present an experimental evaluation, and show that we achieve an average speedup of 2.47x speedup over existing static algorithms. In the incremental setting, we achieve a throughput of up to 48.23 billion edges per second. Compared to state-of-the-art CPU implementations on a 72-core machine, we achieve a speedup of 8.26-14.51x for static connectivity and 1.85-13.36x for incremental connectivity using a Tesla V100 GPU.

show abstract

Section: Finish Algorithmsmentioning

confidence: 99%

Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs

Hong

Dhulipala

Shun

2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…Yilmazer, et al [46] propose a hardware-accelerated finegrained lock scheme for GPUs, which adds support for queuing locks in L1 and L2 caches and uses a customized communication protocol to enable faster lock transfer and to reduce lock retries for non-coherent caches. ElTantawy, et al [13] propose a hardware warp scheduling policy that reduces lock retries by de-prioritizing warps whose threads are spin waiting. In addition, hardware accelerated locks have also been proposed for CPUs [4,25,42,47].…”

Section: Gpu Solutionsmentioning

confidence: 99%

“…To evaluate our solution, we use three state-of-art GPU implementations of irregular algorithms, which have been shown to compare favorably against CPU implementations [18,22,33], and we use two microbenchmarks. which have been used in previous work on fine-grained locking [12,13,46] and transactional memory [10,15,16,37,45] on GPUs. The two microbenchmarks represent commonly used lock patterns for workloads that manipulate irregular data structures, such as graphs and trees.…”

Section: Benchmarks and Inputsmentioning

confidence: 99%

See 1 more Smart Citation

Fast Fine-Grained Global Synchronization on GPUs

Wang

Fussell

Lin

2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

This paper extends the reach of General Purpose GPU programming by presenting a software architecture that supports efficient fine-grained synchronization over global memory. The key idea is to transform global synchronization into global communication so that conflicts are serialized at the thread block level. With this structure, the threads within each thread block can synchronize using low latency, high-bandwidth local scratchpad memory. To enable this architecture, we implement a scalable and efficient message passing library. Using Nvidia GTX 1080 ti GPUs, we evaluate our new software architecture by using it to solve a set of five irregular problems on a variety of workloads. We find that on average, our solutions improve performance over carefully tuned state-of-the-art solutions by 3.6×. CCS Concepts • Computer systems organization → Single instruction, multiple data; • Software and its engineering → Mutual exclusion; Message passing.

show abstract

“…The restart has a similar effect to backoff locking [36], where a spinlocking thread does meaningless work to temporarily relieve contention over the atomic unit; this is useful when DRAM operations are not slow and atomic operations are fast so that the backoff window is small. ElTantawy and Aamodt [10] showed that an adaptive backoff improves the performance even further, since small backoff delay may increase spinning overheads while a large backoff delay may throttle warps more than necessary. From our experiments we find that spinlocks on high-contention nodes-specifically, full and leaf nodes during insertionsreduce the amount of resident warps that can make progress.…”

Section: Restarts Instead Of Spinlocksmentioning

confidence: 99%

Engineering a high-performance GPU B-Tree

Awad

Ashkiani

Johnson

et al. 2019

Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

We engineer a GPU implementation of a B-Tree that supports concurrent queries (point, range, and successor) and updates (insertions and deletions). Our B-tree outperforms the state of the art, a GPU log-structured merge tree (LSM) and a GPU sorted array. In particular, point and range queries are significantly faster than in a GPU LSM (the GPU LSM does not implement successor queries). Furthermore, B-Tree insertions are also faster than LSM and sorted array insertions unless insertions come in batches of more than roughly 100k. Because we cache the upper levels of the tree, we achieve lookup throughput that exceeds the DRAM bandwidth of the GPU. We demonstrate that the key limiter of performance on a GPU is contention and describe the design choices that allow us to achieve this high performance.

show abstract

Warp Scheduling for Fine-Grained Synchronization

Cited by 24 publications

References 27 publications

Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs

Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs

Fast Fine-Grained Global Synchronization on GPUs

Engineering a high-performance GPU B-Tree

Contact Info

Product

Resources

About