CUVLE: Variable-length encoding on CUDA

Fuentes-Alventosa, Antonio; Gómez-Luna, Juan; González-Linares, José María; Guil, Nicolás

doi:10.1109/dasip.2014.7115637

Cited by 8 publications

(5 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the prefix‐sums can be computed efficiently in parallel, Huffman encoding can also be done in parallel. Several GPU implementations for Huffman encoding using this idea have been presented 12,32 . On the other hand, Huffman decoding is very hard to parallelize, because codeword sequence

Y

has no separator and each codeword cannot be identified without reading bits ahead of it.…”

Section: Deflate Encoding and Decodingmentioning

confidence: 99%

“…Several GPU implementations for Huffman encoding using this idea have been presented. 12,32 On the other hand, Huffman decoding is very hard to parallelize, because codeword sequence Y has no separator and each codeword cannot be identified without reading bits ahead of it. Hence, a parallel divide-and-conquer approach that decodes Y from the middle of Y does not work.…”

Section: 2mentioning

confidence: 99%

See 1 more Smart Citation

GPU implementations of deflate encoding and decoding

Takafuji

Nakano

Ito

et al. 2022

Concurrency and Computation

View full text Add to dashboard Cite

Summary Deflate coding is a very popular lossless data compression method used in zlib, gzip (GNU zip), and zip, which performs the LZSS compression algorithm with Huffman coding. Deflate encoding and decoding involve sequential operations and their parallel acceleration using a GPU is quite hard. The main purpose of this paper is to present GPU implementations for encoding and decoding of Deflate coding. For efficient GPU implementations of Deflate coding, we have used multiple small hash tables for finding matching subsequences in the dictionary by multiple threads in parallel and applied the Single Kernel Soft Synchronization (SKSS) technique to fully utilize GPU computing resources. We have also adopted Huffman coding with gap arrays to accelerate parallel Huffman decoding. We have evaluated the performance of our GPU implementations using an NVIDIA A100 GPU and compared them with parallel/sequential Deflate encoding and decoding on the Intel X86 multicore CPUs using multiple threads/a single thread. Our GPU implementation of Deflate decoding is 1.66x–8.33x faster than the multiple thread implementation and 4.13x–36.56x faster than the single thread implementation.

show abstract

Y

has no separator and each codeword cannot be identified without reading bits ahead of it.…”

Section: Deflate Encoding and Decodingmentioning

confidence: 99%

Section: 2mentioning

confidence: 99%

GPU implementations of deflate encoding and decoding

Takafuji

Nakano

Ito

et al. 2022

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…As in our previous works [7,8], the thread-block synchronization mechanism proposed by Yan et al [47] is used for synchronizing the reads with the writes in global memory. In this case, it is applied on both horizontal (d_info_A) and vertical (d_info_B) dimensions and the reads are performed using atomic operations.…”

Section: Fig 14 Transmission Of Parameter Nb Through Global Memorymentioning

confidence: 99%

CAVLCU: an efficient GPU-based implementation of CAVLC

et al. 2021

Self Cite

View full text Add to dashboard Cite

CAVLC (Context-Adaptive Variable Length Coding) is a high-performance entropy method for video and image compression. It is the most commonly used entropy method in the video standard H.264. In recent years, several hardware accelerators for CAVLC have been designed. In contrast, high-performance software implementations of CAVLC (e.g., GPU-based) are scarce. A high-performance GPU-based implementation of CAVLC is desirable in several scenarios. On the one hand, it can be exploited as the entropy component in GPU-based H.264 encoders, which are a very suitable solution when GPU built-in H.264 hardware encoders lack certain necessary functionality, such as data encryption and information hiding. On the other hand, a GPU-based implementation of CAVLC can be reused in a wide variety of GPU-based compression systems for encoding images and videos in formats other than H.264, such as medical images. This is not possible with hardware implementations of CAVLC, as they are non-separable components of hardware H.264 encoders. In this paper, we present CAVLCU, an efficient implementation of CAVLC on GPU, which is based on four key ideas. First, we use only one kernel to avoid the long latency global memory accesses required to transmit intermediate results among different kernels, and the costly launches and terminations of additional kernels. Second, we apply an efficient synchronization mechanism for thread-blocks (In this paper, to prevent confusion, a block of pixels of a frame will be referred to as simply block and a GPU thread block as thread-block.) that process adjacent frame regions (in horizontal and vertical dimensions) to share results in global memory space. Third, we exploit fully the available global memory bandwidth by using vectorized loads to move directly the quantized transform coefficients to registers. Fourth, we use register tiling to implement the zigzag sorting, thus obtaining high instruction-level parallelism. An exhaustive experimental evaluation showed that our approach is between 2.5$$\times$$ × and 5.4$$\times$$ × faster than the only state-of-the-art GPU-based implementation of CAVLC.

show abstract

“…Fuentes-Alventosa et al [47] proposed a GPU implementation of Huffman coding using CUDA with a given table of variablelength codes, which improves the performance by more than 20× compared with a serial CPU implementation. Rahmani et al [48] proposed a CUDA implementation of Huffman coding based on serially constructing the Huffman codeword tree and parallel generating the byte stream, which can achieve up to 22× speedups compared with a serial CPU implementation without any constraint on the maximum codeword length or data entropy.…”

Section: Huffman Coding On Gpumentioning

confidence: 99%

cuSZ

Tian

Zhao

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC applications are becoming heterogeneous using accelerator-based architectures, in particular GPUs, several development teams have recently released GPU versions of their lossy compressors. However, existing state-of-the-art GPU-based lossy compressors suffer from either low compression and decompression throughput or low compression quality. In this paper, we present an optimized GPU version, cuSZ, for one of the best error-bounded lossy compressors-SZ. To the best of our knowledge, cuSZ is the first error-bounded lossy compressor on GPUs for scientific data. Our contributions are fourfold. (1) We propose a dual-qantization scheme to entirely remove the data dependency in the prediction step of SZ such that this step can be performed very efficiently on GPUs. (2) We develop an efficient customized Huffman coding for the SZ compressor on GPUs. (3) We implement cuSZ using CUDA and optimize its performance by improving the utilization of GPU memory bandwidth. (4) We evaluate our cuSZ on five real-world HPC application datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUs and GPUs. Experiments show that our cuSZ improves SZ's compression throughput by up to 370.1× and 13.1×, respectively, over the production version running on single and multiple CPU cores, respectively, while getting the same quality of reconstructed data. It also improves the compression ratio by up to 3.48× on the tested data compared with another state-of-the-art GPU supported lossy compressor.

show abstract

CUVLE: Variable-length encoding on CUDA

Cited by 8 publications

References 6 publications

GPU implementations of deflate encoding and decoding

GPU implementations of deflate encoding and decoding

CAVLCU: an efficient GPU-based implementation of CAVLC

cuSZ

Contact Info

Product

Resources

About