Efficient Algorithms for Stream Compaction on GPUs

Bakunas-Milanowski, Darius; Rego, Vernon; Sang, Janche; Yu, Chansu

doi:10.15803/ijnc.7.2_208

Cited by 10 publications

(8 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We apply V-Quant and RV-Quant to training to minimize memory cost. During training, in order to compress the sparse large activations on GPU, we use the existing work in [1]. In order to obtain quantized networks for inference, we perform fine-tuning with V-Quant for a small number of additional epochs, e.g., 1-3 epochs after total 90 epochs of original training.…”

Section: Methodsmentioning

confidence: 99%

“…Compared with the existing methods of low memory cost in training [2] [5], our proposed method reduces computation cost by avoiding re-computation during back-propagation. More importantly, our proposed method has a potential of further reduction in computation cost especially in Equation (1). It is because the activation y i is mostly in low precision in our method.…”

Section: Potential Of Further Reduction In Computation Costmentioning

confidence: 99%

“…where ∆ w ji represents the update of weight from neuron i (of layer l) to neuron j (of layer l + 1), η learning rate, δ j the local gradient of neuron j (back-propagated error to this neuron), and y i the activation of neuron i. Equation (1) shows that the quantization error of activation y i can affect the weight update. In order to reduce the quantization error in Equation ( 1), we apply V-Quant to activations y i .…”

Section: Back-propagation Of Full-precision Lossmentioning

confidence: 99%

See 2 more Smart Citations

Value-Aware Quantization for Training and Inference of Neural Networks

Park

Yoo

Vajda

2018

Lecture Notes in Computer Science

124

View full text Add to dashboard Cite

We propose a novel value-aware quantization which applies aggressively reduced precision to the majority of data while separately handling a small amount of large data in high precision, which reduces total quantization errors under very low precision. We present new techniques to apply the proposed quantization to training and inference. The experiments show that our method with 3-bit activations (with 2% of large ones) can give the same training accuracy as full-precision one while offering significant (41.6% and 53.7%) reductions in the memory cost of activations in ResNet-152 and Inception-v3 compared with the state-of-the-art method. Our experiments also show that deep networks such as Inception-v3, ResNet-101 and DenseNet-121 can be quantized for inference with 4-bit weights and activations (with 1% 16-bit data) within 1% top-1 accuracy drop.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Potential Of Further Reduction In Computation Costmentioning

confidence: 99%

Section: Back-propagation Of Full-precision Lossmentioning

confidence: 99%

See 1 more Smart Citation

Value-Aware Quantization for Training and Inference of Neural Networks

Park

Yoo

Vajda

2018

Lecture Notes in Computer Science

124

View full text Add to dashboard Cite

show abstract

“…8). Conceptually, this is an application of stream compaction [8] and usually implemented with a prefix sum [56, 13]: Given a bitmap of size M , generate an indices array of size M containing i at position i if the i-th bit is set. Otherwise, store an invalid marker.…”

Section: Number Of Assigned Blocks)mentioning

confidence: 99%

DynaSOAr: A Parallel Memory Allocator for Object-oriented Programming on GPUs with Efficient Memory Access

Springer,

Masuhara

2018

Preprint

View full text Add to dashboard Cite

Object-oriented programming has long been regarded as too inefficient for SIMD high-performance computing, despite the fact that many important HPC applications have an inherent object structure. On SIMD accelerators, including GPUs, this is mainly due to performance problems with memory allocation and memory access: There are a few libraries that support parallel memory allocation directly on accelerator devices, but all of them suffer from uncoalesed memory accesses.We discovered a broad class of object-oriented programs with many important real-world applications that can be implemented efficiently on massively parallel SIMD accelerators. We call this class Single-Method Multiple-Objects (SMMO), because parallelism is expressed by running a method on all objects of a type.To make fast GPU programming available to domain experts who are less experienced in GPU programming, we developed DynaSOAr, a CUDA framework for SMMO applications. DynaSOAr consists of (1) a fully-parallel, lock-free, dynamic memory allocator, (2) a data layout DSL and (3) an efficient, parallel do-all operation. DynaSOAr achieves performance superior to state-of-the-art GPU memory allocators by controlling both memory allocation and memory access.DynaSOAr improves the usage of allocated memory with a Structure of Arrays (SOA) data layout and achieves low memory fragmentation through efficient management of free and allocated memory blocks with lock-free, hierarchical bitmaps. Contrary to other allocators, our design is heavily based on atomic operations, trading raw (de)allocation performance for better overall application performance. In our benchmarks, DynaSOAr achieves a speedup of application code of up to 3x over state-of-the-art allocators. Moreover, DynaSOAr manages heap memory more efficiently than other allocators, allowing programmers to run up to 2x larger problem sizes with the same amount of memory.

show abstract

“…This extra storage cost can be further compressed by exploiting the non-uniform distribution of values[1,43] 6. Applying PWLQ on both weights and activations is discussed in the supplementary material 7.…”

mentioning

confidence: 99%

Post-Training Piecewise Linear Quantization for Deep Neural Networks

Fang

Shafiee

Abdel-Aziz

et al. 2020

Preprint

View full text Add to dashboard Cite

Quantization plays an important role in the energy-efficient deployment of deep neural networks on resource-limited devices. Posttraining quantization is highly desirable since it does not require retraining or access to the full training dataset. The well-established uniform scheme for post-training quantization achieves satisfactory results by converting neural networks from full-precision to 8-bit fixed-point integers. However, it suffers from significant performance degradation when quantizing to lower bit-widths. In this paper, we propose a piecewise linear quantization (PWLQ) scheme to enable accurate approximation for tensor values that have bell-shaped distributions with long tails. Our approach breaks the entire quantization range into non-overlapping regions for each tensor, with each region being assigned an equal number of quantization levels. Optimal breakpoints that divide the entire range are found by minimizing the quantization error. Compared to state-of-theart post-training quantization methods, experimental results show that our proposed method achieves superior performance on image classification, semantic segmentation, and object detection with minor overhead.

show abstract

Efficient Algorithms for Stream Compaction on GPUs

Cited by 10 publications

References 6 publications

Value-Aware Quantization for Training and Inference of Neural Networks

Value-Aware Quantization for Training and Inference of Neural Networks

DynaSOAr: A Parallel Memory Allocator for Object-oriented Programming on GPUs with Efficient Memory Access

Post-Training Piecewise Linear Quantization for Deep Neural Networks

Contact Info

Product

Resources

About