StreamScan

Yan, Shengen; Long, Guangcheng; Zhang, Yunquan

doi:10.1145/2442516.2442539

Cited by 56 publications

(6 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some researchers have also utilized atomic operations for improving fundamental algorithms such as bitonic sort [29], prefix-sum scan [30], wavefront [11], sparse transposition [27], and sparse matrix-vector multiplication [14,16,17]. Unlike those problems, the SpTRSV operation is inherently serial and thus more irregular and complex.…”

Section: Related Workmentioning

confidence: 99%

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

Liu

Hogg

et al. 2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The sparse triangular solve kernel, SpTRSV, is an important building block for a number of numerical linear algebra routines. Parallelizing SpTRSV on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level-sets or colour-sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overhead between the sets. To address this, we propose in this paper a novel approach for SpTRSV in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. A comparison with the state-of-the-art library supplied by the GPU vendor, using 11 sparse matrices on the latest GPU device, show that our approach obtains an average speedup of 2.3 times in single precision and 2.14 times in double precision. The maximum speedups are 5.95 and 3.65, respectively. In addition, our method is an order of magnitude faster for the preprocessing stage than existing methods.

show abstract

Section: Related Workmentioning

confidence: 99%

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

Liu

Hogg

et al. 2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Given a list of allocation requirements for each thread, prefix sum computes the offsets for where each thread should start writing its output elements. Fortunately, efficient GPU prefix sums have been proposed, and the CUB library has already provided standard routines for CUDA users to invoke. Thus, we need only 1 atomic operation for each block.…”

Section: Designmentioning

confidence: 99%

Efficient and high‐quality sparse graph coloring on GPUs

Chen

Fang

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary Graph coloring has been broadly used to discover concurrency in parallel computing. To speed up graph coloring for large‐scale datasets, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations either have limited performance or yield unsatisfactory coloring quality (too many colors assigned). We present a work‐efficient parallel graph coloring implementation on GPUs with good coloring quality. Our approach uses the speculative greedy scheme, which inherently yields better quality than the method of finding maximal independent set. To achieve high performance on GPUs, we refine the algorithm to leverage efficient operators and alleviate conflicts. We also incorporate common optimization techniques to further improve performance. Our method is evaluated with both synthetic and real‐world sparse graphs on the NVIDIA GPU. Experimental results show that our proposed implementation achieves averaged 4.1 × (up to 8.9 × ) speedup over the serial implementation. It also outperforms the existing GPU implementation from the NVIDIA CUSPARSE library (2.2 × average speedup), while yielding much better coloring quality than CUSPARSE.

show abstract

“…The second barrier() entails a global memory fence, which ensures correct ordering of global memory operations. Similar synchronization procedures between adjacent work-groups are explained in [19] [20]. In order to avoid potential deadlocks due to the non-deterministic scheduling of work-groups, we deploy a dynamic work-group ID allocation [19].…”

Section: Fast Padding and Unpadding Kernelsmentioning

confidence: 99%

In-Place Matrix Transposition on GPUs

Gómez-Luna

Sung²,

Chang

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to achieve good performance. In this paper we present our in-place matrix transposition approach for GPUs that is performed using elementary tile-wise transpositions. We propose low-level optimizations for the elementary transpositions, and find the best performing configurations for them. Then, we compare all sequences of transpositions that achieve full transposition, and detect which is the most favorable for each matrix. We present an heuristic to guide the selection of tile sizes, and compare them to brute-force search. We diagnose the drawback of our approach, and propose a solution using minimal padding. With fast padding and unpadding kernels, the overall throughput is significantly increased. Finally, we compare our method to another recent implementation.

show abstract

StreamScan

Cited by 56 publications

References 21 publications

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

Efficient and high‐quality sparse graph coloring on GPUs

In-Place Matrix Transposition on GPUs

Contact Info

Product

Resources

About