Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2013
DOI: 10.1145/2442516.2442539
|View full text |Cite
|
Sign up to set email alerts
|

StreamScan

Abstract: Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict sync… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 56 publications
(6 citation statements)
references
References 21 publications
0
6
0
Order By: Relevance
“…Some researchers have also utilized atomic operations for improving fundamental algorithms such as bitonic sort [29], prefix-sum scan [30], wavefront [11], sparse transposition [27], and sparse matrix-vector multiplication [14,16,17]. Unlike those problems, the SpTRSV operation is inherently serial and thus more irregular and complex.…”
Section: Related Workmentioning
confidence: 99%
“…Some researchers have also utilized atomic operations for improving fundamental algorithms such as bitonic sort [29], prefix-sum scan [30], wavefront [11], sparse transposition [27], and sparse matrix-vector multiplication [14,16,17]. Unlike those problems, the SpTRSV operation is inherently serial and thus more irregular and complex.…”
Section: Related Workmentioning
confidence: 99%
“…Given a list of allocation requirements for each thread, prefix sum computes the offsets for where each thread should start writing its output elements. Fortunately, efficient GPU prefix sums have been proposed, and the CUB library has already provided standard routines for CUDA users to invoke. Thus, we need only 1 atomic operation for each block.…”
Section: Designmentioning
confidence: 99%
“…The second barrier() entails a global memory fence, which ensures correct ordering of global memory operations. Similar synchronization procedures between adjacent work-groups are explained in [19] [20]. In order to avoid potential deadlocks due to the non-deterministic scheduling of work-groups, we deploy a dynamic work-group ID allocation [19].…”
Section: Fast Padding and Unpadding Kernelsmentioning
confidence: 99%