Kepler GPU accelerated recursive sorting using dynamic parallelism

Neelima, B.; Shamsundar, Bharath; Narayan, Anjjan S; Prabhu, R.; Gomes, Crystal

doi:10.1002/cpe.3865

Cited by 8 publications

(9 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…T parallel−CUDA and T parallel−NOpenCL are defined by T p = T kernel + T ovehead + T other (5) where T kernel is the total of the execution times of the kernels on the GPU, T ovehead is the total of the data transfer overhead on the CPU and the GPU, and T other is the total of the execution times of the data structure initialization, and so on. 50 The speedup ratio reflects the overall efficiency improvement of the parallel algorithm in the corresponding architecture compared to the CPU sequential algorithm and can be used for objective evaluation of the actual system speed.…”

Section: Ta B L E 5 Radix Sort Algorithm Execution Time Under Differementioning

confidence: 99%

“…4 The Compute Unified Device Architecture (CUDA) uses the parallel computing engine of the NVIDIA Graphic Processing Unit (GPU) to achieve a more efficient computing solution than the CPU for solving many complex computing tasks. 5 However, there are certain problems. For example, in terms of software porting, NVIDIA GPUs and AMD GPUs are not compatible with each other, and parallel algorithms are not portable.…”

Section: Introductionmentioning

confidence: 99%

“…The Compute Unified Device Architecture (CUDA) uses the parallel computing engine of the NVIDIA Graphic Processing Unit (GPU) to achieve a more efficient computing solution than the CPU for solving many complex computing tasks 5 . However, there are certain problems.…”

Section: Introductionmentioning

confidence: 99%

“…and(3)are both prefix calculations with time complexity O(n/p + log p). In steps (4) and(5), an auxiliary array Q is used to update the array number, which takes O(n/p) times. Therefore, the total time complexity of the radix sorting parallel algorithm is O(m(n/p + log p) ).…”

mentioning

confidence: 99%

See 3 more Smart Citations

A radix sorting parallel algorithm suitable for graphic processing unit computing

Xiao

Guo

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

Radix sorting is an essential basic data processing operation in many computer fields. It has important practical significance to accelerate its performance through Graphic Processing Unit (GPU). The heterogeneous parallel computing technology attracts much attention and is widely applied for its effective computation efficiency and parallel real-time data processing capability. Taking advantage of the parallelism of GPU in numerical computation processing, a parallelization design method of the Binary_Least Significant Digit (LSD) first Radix Sorting (B_LSD_RS) algorithm based on Open Computing Language (OpenCL) is proposed. The radix sorting algorithm is divided into multiple kernel tasks, and the kernels are sequentially controlled by the event information transfer. The parallel algorithm is implemented and verified on the GPU + CPU heterogeneous platform. The experimental results show that compared with the performance of the B_LSD_RS sequential algorithm based on AMD Ryzen5 1600X CPU, B_LSD_RS parallel algorithm based on Open Multi-Processing (OpenMP) and B_LSD_RS parallel algorithm based on Compute Unified Device Architecture (CUDA), the B_LSD_RS parallel algorithm based on OpenCL obtained 28.86 times, 11.01 times and 2.14 times speedup in the NVIDIA GTX 1070 computing platform respectively, not only achieves high performance but also achieves performance portability among different GPU computing platforms.

show abstract

Section: Ta B L E 5 Radix Sort Algorithm Execution Time Under Differementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

A radix sorting parallel algorithm suitable for graphic processing unit computing

Xiao

Guo

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…This is a building block of suffix sorting, used in string matching, and database index construction [13]. Parallel string sorting algorithms have been proposed on CPUs [14] and GPUs [15], however, to the best of our knowledge, no hardware accelerator for this problem has been made available yet. Indeed, handling variable-length keys in hardware is not only challenging per se but also involves key comparisons that can become expensive as keys are arbitrarily long.…”

Section: Introductionmentioning

confidence: 99%

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

Asiatici

Maiorano

Ienne

2021

J Sign Process Syst

View full text Add to dashboard Cite

String sorting is a fundamental kernel of string matching and database index construction; yet, it has not been studied as extensively as fixed-length keys sorting. Because processing variable-length keys in hardware is challenging, it is no surprise that no hardware-accelerated string sorters have been proposed yet. In this paper, we present Parallel Hybrid Super Scalar String Sample Sort (pHS 5 ) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade CPU. Our pHS 5 extends pS 5 , the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, by adding multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable among the dominant kernels of pS 5 by up to 33% compared to a single Intel Xeon Broadwell core despite a clock frequency that is 17 times slower. Furthermore, we extended the job scheduling mechanism of pS 5 to schedule the accelerable kernel not only among available CPU cores but also on our PEs, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. Overall, we accelerate the entire algorithm by up to 10% with respect to the 28-thread software baseline running on the Xeon processor and by up to 36% at lower thread counts. Finally, we generalize our results assuming pS 5 as representative of software that is heavily optimized for modern multi-core CPUs and investigate the performance and energy advantage that an FPGA in a datacenter setting can offer to regular RTL users compared to additional CPU cores.

show abstract

Toward a new approach for sorting extremely large data files in the big data era

et al. 2018

View full text Add to dashboard Cite

Kepler GPU accelerated recursive sorting using dynamic parallelism

Cited by 8 publications

References 26 publications

A radix sorting parallel algorithm suitable for graphic processing unit computing

A radix sorting parallel algorithm suitable for graphic processing unit computing

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

Toward a new approach for sorting extremely large data files in the big data era

Contact Info

Product

Resources

About