Approaches for parallelizing reductions on modern GPUs

Huo, Xin; Ravi, Vignesh T.; Ma, Wei; Agrawal, Gagan

doi:10.1109/hipc.2010.5713189

Cited by 6 publications

(6 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Baskaran et al [2008] proposed a compiler framework for optimizing memory access in affine loops. Huo et al [2010]; Gutierrez et al [2008] show that several applications are improved by using scratchpad memory instead of using global memory.…”

Section: Miscellaneousmentioning

confidence: 99%

Scratchpad Sharing in GPUs

Jatala

Anantpur

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

GPGPU applications exploit on-chip scratchpad memory available in the Graphics Processing Units (GPUs) to improve performance. The amount of thread level parallelism (TLP) present in the GPU is limited by the number of resident threads, which in turn depends on the availability of scratchpad memory in its streaming multiprocessor (SM). Since the scratchpad memory is allocated at thread block granularity, part of the memory may remain unutilized. In this paper, we propose architectural and compiler optimizations to improve the scratchpad memory utilization. Our approach, called Scratchpad Sharing, addresses scratchpad under-utilization by launching additional thread blocks in each SM. These thread blocks use unutilized scratchpad memory and also share scratchpad memory with other resident blocks. To improve the performance of scratchpad sharing, we propose Owner Warp First (OWF) scheduling that schedules warps from the additional thread blocks effectively. The performance of this approach, however, is limited by the availability of the part of scratchpad memory that is shared among thread blocks.We propose compiler optimizations to improve the availability of shared scratchpad memory. We describe a scratchpad allocation scheme that helps in allocating scratchpad variables such that shared scratchpad is accessed for short duration. We introduce a new hardware instruction, relssp, that when executed, releases the shared scratchpad memory. Finally, we describe an analysis for optimal placement of relssp instructions such that shared scratchpad memory is released as early as possible, but only after its last use, along every execution path.We implemented the hardware changes required for scratchpad sharing approach and the new instruction (relssp) using the GPGPU-Sim simulator, and implemented the compiler optimizations in Ocelot framework. We evaluated the effectiveness of our approach on 19 kernels from 3 benchmarks suites: CUDA-SDK, GPGPU-Sim, and Rodinia. The kernels that under-utilize scratchpad memory show an average improvement of 19% and maximum improvement of 92.17% compared to the baseline approach, without affecting the performance of the kernels that do not waste scratchpad memory.

show abstract

Section: Miscellaneousmentioning

confidence: 99%

Scratchpad Sharing in GPUs

Jatala

Anantpur

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Michela Becchi et al [15] proposed moving computation to data to reduce the communication overheads, that is, if a previous function generates the data on CPU, then the next function yields better performance on CPU using it rather than moving the data and the computation onto GPU from CPU and vice-versa. Xin Huo et al [17] proposed 51 parallelizing reductions in which group of threads use atomic operations to update one copy of the reduction object. They have shown that programmers' productivity and the performance improvement is achieved by decoupling thread array structure with data layout.…”

Section: Related Workmentioning

confidence: 99%

“…They have shown that programmers' productivity and the performance improvement is achieved by decoupling thread array structure with data layout. Xin Huo et al [17] proposed 51 parallelizing reductions in which group of threads use atomic operations to update one copy of the reduction object. Andrew et al [18] selected some parameters and gave an optimal model for workloads using Ocelot framework to convert the PTX into LLVM-Intermediate Representation (IR).…”

Section: Related Workmentioning

confidence: 99%

Communication and computation optimization of concurrent kernels using kernel coalesce on a GPU

Neelima

Reddy

Raghavendra

2013

Concurrency and Computation

View full text Add to dashboard Cite

General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi-clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro-benchmark1 considered in this paper resulted in 10-40% and 80-92% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9% and 152.3% improvement compared with separate kernel launch and 39.5% and 36.8% improvement compared with block level kernel coalesce method, respectively. 48 B. NEELIMA, G. R. M. REDDY AND P. S. RAGHAVENDRA models such as compute unified device architecture (CUDA), Open specification for Compute Language. The data parallel applications are ported onto GPU, and GPU can give higher performances than CPU for such applications. The resource allocation on GPU is defined at the grid level by the programmer. The programs have usually a g...

show abstract

“…In fact, lots of applications have get benefits from the massive parallelism capability of GPU [13], [14], [25], [27], [33], [36], [38]. In addition, researchers have also utilized GPU to solve some specific artificial intelligence (AI) problems successfully [1].…”

Section: Introductionmentioning

confidence: 99%

A Parallel Algorithm for Game Tree Search Using GPGPU

Liu

Wang

et al. 2015

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Game tree search is a classical problem in the field of game theory and artificial intelligence. Fast game tree search algorithm is critical for computer games asking for real-time responses. In this paper, we focus on how to leverage massive parallelism capabilities of GPU to accelerate the speed of game tree search algorithms and propose a concise and general parallel game tree search algorithm on GPU. The performance model of our algorithm is presented and analyzed theoretically. We implement the algorithm for two real computer games called Connect6 and Chess. We also use these two games to verify the effectiveness and efficiency of our algorithm. Experiments support our theoretical results and show good performance of our approach. Compared to classical CPU-based game tree search algorithms, our algorithm can achieve speedups of 89.95x for Connect6 and 11.43x for Chess, in case of no pruning. When pruning is considered, which means the practical performance of our algorithm, the speedup can reach about 10.58x for Connect6 and 7.26x for Chess. The insight of our work is that using GPU is a feasible way to improve the performance of game tree search algorithms.

show abstract

Approaches for parallelizing reductions on modern GPUs

Cited by 6 publications

References 22 publications

Scratchpad Sharing in GPUs

Scratchpad Sharing in GPUs

Communication and computation optimization of concurrent kernels using kernel coalesce on a GPU

A Parallel Algorithm for Game Tree Search Using GPGPU

Contact Info

Product

Resources

About