Fast segmented sort on GPUs

Hou, Kaixi; Liu, Weifeng; Wang, Hao; Feng, Wu-chun

doi:10.1145/3079079.3079105

Cited by 52 publications

(21 citation statements)

References 45 publications

(53 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other research looked at specific cases of scan, in [58] the authors look at performing scan on tuples while minimizing global reads and facilitating latency hiding. Recently there has been some work in applying scan and reduction to optimize database queries [37,47,81].…”

Section: Related Workmentioning

confidence: 99%

Accelerating reduction and scan using tensor core units

Dakkak

Xiong

et al. 2019

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4 × 4 or 16 × 16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89% − 98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100× for reduction and 3× for scan) than state-of-the-art methods for small segment sizes -common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and 16% for scan.

show abstract

Section: Related Workmentioning

confidence: 99%

Accelerating reduction and scan using tensor core units

Dakkak

Xiong

et al. 2019

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…The binary finite field multiplication algorithm was implemented by Eli Ben-Sasson et al yielded up to 138× speedup than the popular Number Theory Library [5]. Hou et al [14] implemented a register-based sort method shows great improvements over scratchpad memory methods on NVIDIA K80-Kepler and TitanX-Pascal GPUs. A 1-D stencil method is introduced as an example to illustrate how register cache and shuffle instruction works [5].…”

Section: Related Workmentioning

confidence: 99%

A versatile software systolic execution model for GPU memory-bound kernels

Chen

Wahib

Takizawa

et al. 2019

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5× faster than Nvidia's NPP on V100 and P100 GPUs. CCS CONCEPTS• Computer systems organization → Systolic arrays; Multicore architectures.

show abstract

“…Compared to stochastic gradient descent (SGD) [8,9], the ALS algorithm is not only inherently parallel, but can incorporate implicit ratings [1]. Nevertheless, the ALS algorithm involves parallel sparse matrix manipulation [10] which is challenging to achieve high performance due to imbalanced workload [11,12,13], random memory access [14,15], unpredictable amount of computations [16] and task dependency [17,18,19]. This particularly holds when parallelizing and optimizing ALS on modern multi-cores and many-cores [20].…”

Section: Introductionmentioning

confidence: 99%

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Chen

Fang

Liu

et al. 2020

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS solver (clMF) for recommender systems. On one hand, we diagnose the baseline implementation and observe that it lacks of the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique, the fine-grained tiling technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently map it to the underlying hardware. The experimental results show that our implementation performs 2.8×-15.7× faster on an Intel 16-core CPU, 23.9×-87.9× faster on an NVIDIA K20C GPU and 34.6×-97.1× faster on an AMD Fury X GPU than the baseline implementation. On the K20C GPU, our implementation also outperforms cuMF over different latent features ranging from 10 to 100 with various real-world recommendation datasets.

show abstract

Fast segmented sort on GPUs

Cited by 52 publications

References 45 publications

Accelerating reduction and scan using tensor core units

Accelerating reduction and scan using tensor core units

A versatile software systolic execution model for GPU memory-bound kernels

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Contact Info

Product

Resources

About