Highly parallel transformation and quantization for HEVC encoder on GPUs

Igarashi, Hiroaki; Takano, Fumiyo; Moriyoshi, Tatsuji

doi:10.1109/vcip.2016.7805520

Cited by 5 publications

(2 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Regarding the migration of HEVC TQ to the GPU in [11] two tables, one describing transform unit (TU) partitioning and the other quantization parameter (QP) value storing, together with the mapping algorithm at the CTU level, were proposed to achieve efficient implementation. In [12] authors dealt with a heterogeneous system for HEVC encoder where motion-compensated prediction processing already resides at the GPU side and additionally the TQ has to be ported there. Parallel TU address list construction and coefficient packing were proposed to get high processing speed.…”

Section: Related Work and Motivationmentioning

confidence: 99%

“…Additionally, it has to be mentioned that fair comparison using only TQ processing time is not feasible. In [11], GPU acceleration is done at the CTU block level and in [12] and [14] it is done at the frame level with transform blocks previously grouped for GPU acceleration. The latter approach allows much better use of GPU parallelism.…”

Section: Related Work and Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Performance engineering for HEVC transform and quantization kernel on GPUs

et al. 2020

View full text Add to dashboard Cite

Continuous growth of video traffic and video services, especially in the field of high resolution and high-quality video content, places heavy demands on video coding and its implementations. High Efficiency Video Coding (HEVC) standard doubles the compression efficiency of its predecessor H.264/AVC at the cost of high computational complexity. To address those computing issues high-performance video processing takes advantage of heterogeneous multiprocessor platforms. In this paper, we present a highly performance-optimized HEVC transform and quantization kernel with all-zero-block (AZB) identification designed for execution on a Graphics Processor Unit (GPU). Performance optimization strategy involved all three aspects of parallel design, exposing as much of the application's intrinsic parallelism as possible, exploitation of high throughput memory and efficient instruction usage. It combines efficient mapping of transform blocks to thread-blocks and efficient vectorized access patterns to shared memory for all transform sizes supported in the standard. Two different GPUs of the same architecture were used to evaluate proposed implementation. Achieved processing times are 6.03 and 23.94 ms for DCI 4K and 8K Full Format, respectively. Speedup factors compared to CPU, cuBLAS and AVX2 implementations are up to 80, 19 and 4 times respectively. Proposed implementation outperforms previous work 1.22 times.

show abstract

Section: Related Work and Motivationmentioning

confidence: 99%

Section: Related Work and Motivationmentioning

confidence: 99%