Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Scogland, Thomas R. W.; Feng, Wu-chun

doi:10.1145/2668930.2688048

Cited by 17 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The major reason is that the Maxwell architecture dramatically improves its micro-architectures for faster atomic operations, which are extensively utilized in our approach. Actually, Scogland and Feng [25] also confirmed that atomic operations have been continuously improved in the last generations of modern GPUs. Moreover, although the AMD Fury X GPU has higher bandwidth than the NVIDIA Titan X, it is in general slower for our synchronizationfree SpTRSV algorithm.…”

Section: Sptrsv Performancementioning

confidence: 84%

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

Liu

Hogg

et al. 2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The sparse triangular solve kernel, SpTRSV, is an important building block for a number of numerical linear algebra routines. Parallelizing SpTRSV on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level-sets or colour-sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overhead between the sets. To address this, we propose in this paper a novel approach for SpTRSV in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. A comparison with the state-of-the-art library supplied by the GPU vendor, using 11 sparse matrices on the latest GPU device, show that our approach obtains an average speedup of 2.3 times in single precision and 2.14 times in double precision. The maximum speedups are 5.95 and 3.65, respectively. In addition, our method is an order of magnitude faster for the preprocessing stage than existing methods.

show abstract

Section: Sptrsv Performancementioning

confidence: 84%

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

Liu

Hogg

et al. 2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…The major reason is that the Pascal architecture is equipped with higher bandwidth and improved micro-architectures for atomic operations, which are extensively utilized in our approach. Actually, Scogland and Feng [39] also confirmed that atomic operations have been continuously improved in the latest generations of modern GPUs. Moreover, although the AMD Fury X GPU has slightly higher bandwidth than the NVIDIA Titan X, it is in general slower for our synchronization-free SpTRSV algorithm.…”

Section: Sptrsv Performancementioning

confidence: 85%

Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Liu

Hogg

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level-sets or coloursets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overheads between the sets. To address this, we propose in this paper novel approaches for SpTRSV and SpTRSM in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. To further exploit the data-parallelism, we also develop an adaptive scheme for efficiently processing multiple right-hand sides in SpTRSM. A comparison with a state-of-the-art library supplied by the GPU vendor, using 20 sparse matrices on the latest GPU device, shows that the proposed approach obtains an average speedup of over two for SpTRSV and up to an order of magnitude speedup for SpTRSM. In addition, our method is up to two orders of magnitude faster for the preprocessing stage than existing SpTRSV and SpTRSM methods.

show abstract

“…An extensive body of work has embarked on the redesign of data structures for construction and general computation on the GPU [88]. Within the context of searching, these acceleration structures include sorted arrays [3], [4], [8], [51], [66], [67], [98] and linked lists [116], hash tables (see section III), spatial-partitioning trees (e.g., k-d trees [57], [115], [120], octrees [57], [119], bounding volume hierarchies (BVH) [57], [64], R-trees [71], and binary indexing trees [59], [99]), spatial-partitioning grids (e.g., uniform [36], [53], [62] and two-level [52]), skiplists [81], and queues (e.g., binary heap priority [43] and FIFO [17], [101]). Due to significant architectural differences between the CPU and GPU, search structures cannot simply be "ported" from the CPU to the GPU and maintain optimal performance.…”

Section: Gpu Searchingmentioning

confidence: 99%

Data-Parallel Hashing Techniques for GPU Architectures

Lessley

Childs

2020

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Hash tables are one of the most fundamental data structures for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. This study surveys the stateof-the-art research on data-parallel hashing techniques for emerging massively-parallel, many-core GPU architectures. Key factors affecting the performance of different hashing schemes are discovered and used to suggest best practices and pinpoint areas for further research.

show abstract

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Cited by 17 publications

References 14 publications

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Data-Parallel Hashing Techniques for GPU Architectures

Contact Info

Product

Resources

About