Software Transactional Memory for GPU Architectures

Xu, Yunlong; Wang, Rui; Goswami, Nilanjan; Li, Tao; Lan, Guanghui; Qian, Depei

doi:10.1145/2544137.2544139

Cited by 21 publications

(21 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, previous work attempts to improve the performance and programmability of GPUs by supporting transactional memory [10,11,15,16,37,45] and by providing memory consistency and memory coherence on GPUs [5,19,36,[38][39][40].…”

Section: Gpu Solutionsmentioning

confidence: 99%

“…To evaluate our solution, we use three state-of-art GPU implementations of irregular algorithms, which have been shown to compare favorably against CPU implementations [18,22,33], and we use two microbenchmarks. which have been used in previous work on fine-grained locking [12,13,46] and transactional memory [10,15,16,37,45] on GPUs. The two microbenchmarks represent commonly used lock patterns for workloads that manipulate irregular data structures, such as graphs and trees.…”

Section: Benchmarks and Inputsmentioning

confidence: 99%

See 1 more Smart Citation

Fast Fine-Grained Global Synchronization on GPUs

Wang

Fussell

Lin

2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

This paper extends the reach of General Purpose GPU programming by presenting a software architecture that supports efficient fine-grained synchronization over global memory. The key idea is to transform global synchronization into global communication so that conflicts are serialized at the thread block level. With this structure, the threads within each thread block can synchronize using low latency, high-bandwidth local scratchpad memory. To enable this architecture, we implement a scalable and efficient message passing library. Using Nvidia GTX 1080 ti GPUs, we evaluate our new software architecture by using it to solve a set of five irregular problems on a variety of workloads. We find that on average, our solutions improve performance over carefully tuned state-of-the-art solutions by 3.6×. CCS Concepts • Computer systems organization → Single instruction, multiple data; • Software and its engineering → Mutual exclusion; Message passing.

show abstract

Section: Gpu Solutionsmentioning

confidence: 99%

Section: Benchmarks and Inputsmentioning

confidence: 99%

Fast Fine-Grained Global Synchronization on GPUs

Wang

Fussell

Lin

2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

show abstract

“…Yunlong Xu et.al. have developed a STM based technique for GPU based systems in [5]. The authors claim that their technique is free from livelocks and is scalable.…”

Section: Related Workmentioning

confidence: 99%

Pulsating STM – The in-memory Optimistic Concurrency Control Technique for Multi Core Systems

Jafar*,

Rajnish,

Kumar

2019

IJEAT

View full text Add to dashboard Cite

In the world of ever increasing parallelism, the problem of deadlock-free concurrency control is inevitable. As the number of processing cores is increasing, the number of processing threads is also increasing, and with this increase in the number of processing threads, there is a good chance of problems arising due to lack of proper concurrency control. The application areas under the domain of advanced graphics, cryptography, deep learning, embedded system programming, artificial intelligence and networking are prone to the problems of heavy uncontrolled concurrency of threads. This paper presents a novel Software Transactional Memory (STM) based optimistic concurrency control technique that is deadlock free for threads accessing the in-memory data structure for the purpose of reading as well as writing. The technique is lock free and is based upon timestamping. Threads involved in the proposed approach possess the transactional properties of atomicity, concurrency and isolation. Durability is not expected as the threads are working on an in-memory data source. The approach involves lazy conflict detection that ensures minimum aborts and restarts as well as maximum concurrency among transactions. Being lock free, the algorithm is better than the existing lock-based techniques. The technique is tested on Sniper-6.1 multi core simulator simulating 64 CPU cores and running 16, 32, 40 and 50 threads in our case. The results show significant improvement in throughput with the increasing number of threads over the existing lock-based techniques as well as other STM techniques based on optimistic concurrency control.

show abstract

“…Now, using spinlock to realize locking on GPUs may lead to a deadlock, consider the locking scheme of Table .…”

Section: Deadlocksmentioning

confidence: 99%

A deadlock‐free lock‐based synchronization for GPUs

Anand

Srivastava

Shyamasundar

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary Graphics Processing Units (GPUs) have evolved from pure graphics applications toward general purpose applications, often referred to as GPGPU computing. However, its scope is still limited to data‐parallel applications that require little synchronization. As synchronization on GPUs is quite costly, synchronization requirements in GPUs are usually realized using existing synchronization primitives like atomic operations and barriers. These approaches either incur significant overhead or place certain restrictions in their usage, affecting the scalability/scope of such applications. The lack of adequate support for fine‐grained synchronization has restricted the realization of irregular algorithms on GPUs, wherein control flow and memory access patterns are data‐dependent and unpredictable. Recently, there has been an interest in building relationship between lock‐step semantics and interleaving semantics and to develop lock‐based synchronization mechanism for GPUs to overcome these issues. GPUs follow SIMD, and hence, when adapted for general purpose computing, new distinct deadlock scenarios arise. In this paper, we discuss various deadlock scenarios that can happen in GPUs, and present a modeling of deadlocks in GPUs. We shall first illustrate such deadlock scenarios in GPU applications, and then describe a novel lock‐based deadlock‐free, fine‐grained synchronization mechanism for GPU architectures that overcomes deadlocks without a significant overhead. We further establish the correctness of our methods and discuss the performance overheads.

show abstract

Software Transactional Memory for GPU Architectures

Cited by 21 publications

References 20 publications

Fast Fine-Grained Global Synchronization on GPUs

Fast Fine-Grained Global Synchronization on GPUs

Pulsating STM – The in-memory Optimistic Concurrency Control Technique for Multi Core Systems

A deadlock‐free lock‐based synchronization for GPUs

Contact Info

Product

Resources

About