Summary This paper focuses on the performance gain obtained on the Kepler graphics processing units (GPUs) for multi‐key quicksort. Because multi‐key quicksort is a recursive‐based algorithm, many of the researchers have found it tedious to parallelize the algorithm on the multi and many core architectures. A survey of the state‐of‐the‐art string sorting algorithms and a robust insight of the Kepler GPU architecture gave rise to an intriguing research idea of matching the template of multi‐key quicksort with the dynamic parallelism feature offered by the Kepler‐based GPU's. The CPU parallel implementation has an improvement of 33 to 50% and 62 to 75 improvement when compared with 8‐bit and 16‐bit parallel most significant digit radix sort, respectively. The GPU implementation of multi‐key quicksort gives 6× to 18× speed up compared with the CPU parallel implementation of parallel multi‐key quicksort. The GPU implementation of multi‐key quicksort achieves 1.5× to 3× speed up when compared with the GPU implementation of string sorting algorithm using singleton elements in the literature. Copyright © 2016 John Wiley & Sons, Ltd.
Many-threaded architecture based Graphics Processing Units (GPUs) are good for general purpose computations for achieving high performance. The processor has latency hiding mechanism through which it hides the memory access time in such a way that when one warp (group of 32 threads) is computing, the other warps perform memory bound access. But for memory access bound irregular applications such as Sparse Matrix Vector Multiplication (SpMV), memory access times are high and hence improving the performance of such applications on GPU is a challenging research issue. Further, optimizing SpMV time on GPU is an important task for iterative applications like jacobi and conjugate gradient. However, there is a need to consider the overheads caused while computing SpMV on GPU. Transforming the input matrix to a desired format and communicating the data from CPU to GPU are non-trivial overheads associated with SpMV computation on GPU. If the chosen format is not suitable for the given input sparse matrix then desired performance improvements cannot be achieved.Motivated by this observation, this paper proposes a method to chose an optimal sparse matrix format, focusing on the applications where CPU to GPU communication time and preprocessing time are nontrivial. The experimental results show that the predicted format by the model matches with that of the actual high performing format when total SpMV time in terms of pre-processing time, CPU to GPU communication time and SpMV computation time on GPU, is taken into account. The model predicts an optimal format for any given input sparse matrix with a very small overhead of prediction within an application. Compared to the format to achieve high performance only on GPU, our approach is more comprehensive and valuable. This paper also proposes to use a communication and pre-processing overhead optimizing sparse matrix format to be used when these overheads are non trivial.
General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi-clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro-benchmark1 considered in this paper resulted in 10-40% and 80-92% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9% and 152.3% improvement compared with separate kernel launch and 39.5% and 36.8% improvement compared with block level kernel coalesce method, respectively. 48 B. NEELIMA, G. R. M. REDDY AND P. S. RAGHAVENDRA models such as compute unified device architecture (CUDA), Open specification for Compute Language. The data parallel applications are ported onto GPU, and GPU can give higher performances than CPU for such applications. The resource allocation on GPU is defined at the grid level by the programmer. The programs have usually a g...
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
The hardware and software evolutions related to Graphics Processing Units (GPUs), for general purpose computations, have changed the way the parallel programming issues are addressed. Many applications are being ported onto GPU for achieving performance gain. The GPU execution time is continuously optimized by the GPU programmers while optimizing pre-GPU computation overheads attracted the research community in the recent past. While GPU executes the programs given by a CPU, pre-GPU computation overheads does exists and should be optimized for a better usage of GPUs. The GPU framework proposed in this paper improves the overall performance of the application by optimizing pre-GPU computation overheads along with GPU execution time. This paper proposes a sparse matrix format prediction tool to predict an optimal sparse matrix format to be used for a given input matrix by analyzing the input sparse matrix and considering pre-GPU computation overheads. The sparse matrix format predicted by the proposed method is compared against the best performing sparse matrix formats posted in the literature. The proposed model is based on the static data that is available from the input directly and hence the prediction overhead is very small. Compared to GPU specific sparse format prediction, the proposed model is more inclusive and precious in terms of increasing overall application's performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.