Understanding the Performance of Concurrent Data Structures on Graphics Processors

Cederman, Daniel; Chatterjee, Bapi; Tsigas, Philippas

doi:10.1007/978-3-642-32820-6_87

Cited by 16 publications

(12 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The single thread based operations tend to incur more contentions in CAS operations. As reported by Cederman et al [5], the GPU based queue operations are slower than their multi-core equivalents.…”

Section: Introductionmentioning

confidence: 79%

“…Many GPU based libraries supporting various programming primitives have been proposed (e,g, [3], [4]). On the other hand, the FIFO queue, which is one of the most fundamental data structures and has wide applications, has only attracted limited research efforts (e.g., [5], [6]). In this paper, we propose an efficient concurrent lock-free queue for GPGPU.…”

Section: Introductionmentioning

confidence: 99%

“…Misra and Chaudhuri [6] demonstrated the usage of GPUs CAS operator to implement various concurrent data structures. Cederman et al [5] implemented the concurrent lock-free queue proposed by Michael and Soctt on GPUs. Both of these works only enable a lower level of performance than what can be achieved on multi-core platforms.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Toward Concurrent Lock-Free Queues on GPUs

Zhang

Deng

2014

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYGeneral purpose computing on GPU (GPGPU) has become a popular computing model for high-performance, data-intensive applications. Accordingly, there is a strong need to develop highly efficient data structures to ease the development of GPGPU applications. In this work, we proposed an efficient concurrent queue data structure for GPU computing. The GPU based provably correct, lock-free FIFO queue allows a massive number of concurrent producers and consumers. Warp-centric en-queue and de-queue procedures are introduced to better match the underlying Single-Instruction, Multiple-Thread execution model of modern GPUs. It outperforms the best previous GPU queues by up to 40 fold. The correctness of the proposed queue operations is formally validated by linearizability criteria.

show abstract

Section: Introductionmentioning

confidence: 79%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Toward Concurrent Lock-Free Queues on GPUs

Zhang

Deng

2014

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…Finally, several of these queues have been evaluated on CUDA GPUs by Cederman et al [3]. Out of a number of lock-based and two lock-free designs (i.e., MS-queue and TZqueue), they conclude that for higher concurrency, the two lock-free queue designs are nearly always highest performing.…”

Section: Related Workmentioning

confidence: 99%

“…By using a single thread to perform all operations on the queue at any given time, the maximum throughput at any level of contention is the same. At all numbers of threads we model the combining queue using the atomic latency for one thread in terms of Equation 3. This form is effectively the same as one would use for a serial queue, except that an additional read is performed to determine the operation to perform.…”

Section: Queue Throughput Modelingmentioning

confidence: 99%

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Scogland

Feng

2015

Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering

View full text Add to dashboard Cite

As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of programming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal.In this paper, we present three contributions: (1) a characterization of queue designs in terms of modern multi-and many-core architectures, (2) the design of a high-throughput, linearizable, blocking, concurrent FIFO queue for many-core architectures that avoids the bottlenecks and pitfalls common in modern queue designs, and (3) a thorough evaluation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focusing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magnitude (1000×) faster than lock-free and combining queues on GPU platforms and two times (2×) faster on CPU devices. These results deliver critical insights into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can increase throughput.

show abstract

Information Communication Technologies

Ict¹

2020

Encyclopedia of Education and Information Technologies

View full text Add to dashboard Cite

Understanding the Performance of Concurrent Data Structures on Graphics Processors

Cited by 16 publications

References 13 publications

Toward Concurrent Lock-Free Queues on GPUs

Toward Concurrent Lock-Free Queues on GPUs

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Information Communication Technologies

Contact Info

Product

Resources

About