2010 International Conference on High Performance Computing 2010
DOI: 10.1109/hipc.2010.5713189
|View full text |Cite
|
Sign up to set email alerts
|

Approaches for parallelizing reductions on modern GPUs

Abstract: GPU hardware and software has been evolving rapidly. CUDA versions 1.1 and higher started supporting atomic operations on device memory, and CUDA versions 1.2 and higher started supporting atomic operations on shared memory. This paper focuses on parallelizing applications involving reductions on GPUs. Prior to the availability of support for locking, these applications could only be parallelized using full replication, i.e., by creating a copy of the reduction object for each thread. However, CUDA 1.1 (1.2) o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2012
2012
2017
2017

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 22 publications
0
6
0
Order By: Relevance
“…Baskaran et al [2008] proposed a compiler framework for optimizing memory access in affine loops. Huo et al [2010]; Gutierrez et al [2008] show that several applications are improved by using scratchpad memory instead of using global memory.…”
Section: Miscellaneousmentioning
confidence: 99%
“…Baskaran et al [2008] proposed a compiler framework for optimizing memory access in affine loops. Huo et al [2010]; Gutierrez et al [2008] show that several applications are improved by using scratchpad memory instead of using global memory.…”
Section: Miscellaneousmentioning
confidence: 99%
“…Michela Becchi et al [15] proposed moving computation to data to reduce the communication overheads, that is, if a previous function generates the data on CPU, then the next function yields better performance on CPU using it rather than moving the data and the computation onto GPU from CPU and vice-versa. Xin Huo et al [17] proposed 51 parallelizing reductions in which group of threads use atomic operations to update one copy of the reduction object. They have shown that programmers' productivity and the performance improvement is achieved by decoupling thread array structure with data layout.…”
Section: Related Workmentioning
confidence: 99%
“…They have shown that programmers' productivity and the performance improvement is achieved by decoupling thread array structure with data layout. Xin Huo et al [17] proposed 51 parallelizing reductions in which group of threads use atomic operations to update one copy of the reduction object. Andrew et al [18] selected some parameters and gave an optimal model for workloads using Ocelot framework to convert the PTX into LLVM-Intermediate Representation (IR).…”
Section: Related Workmentioning
confidence: 99%
“…In fact, lots of applications have get benefits from the massive parallelism capability of GPU [13], [14], [25], [27], [33], [36], [38]. In addition, researchers have also utilized GPU to solve some specific artificial intelligence (AI) problems successfully [1].…”
Section: Introductionmentioning
confidence: 99%