Wavelet Transform for Large Scale Image Processing on Modern Microprocessors

Chaver, Daniel; Tenllado, Christian; Piæuel, Luis; Prieto, Manuel; Tirado, Francisco

doi:10.1007/3-540-36569-9_37

Cited by 8 publications

(10 citation statements)

References 10 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The problem with 4D layout is common to all segmentation methods -we have to figure out the boundary handling. In [4] the proposed method for convolution approach is to store boundary coefficients in special side buffers. For fast lifting implementation the solution is slightly different, as we need to buffer only one value on each side, but we have to resynchronize those buffers every lifting step because of the data dependency (Fig.…”

Section: Memory Layout Problemsmentioning

confidence: 99%

Lifting-based wavelet transform for images on modern CPU architectures

Maly

Rajmic

2008

2008 International Conference on Signals and Electronic Systems

View full text Add to dashboard Cite

This article analyzes performance of various 2D wavelet transform implementations based on the lifting scheme for the CDF 9/7 filterbank with respect to the modern and most recent CPU architectures. We propose three methods combining different approaches to memory locality handling and parallelism, which is obtained using Intel Threading Building Blocks developer libraries. Implementations were tested on a wide range of Intel-based personal computers.Index Terms-fast lifting wavelet transform, parallel optimization, threading building blocks, 4D layout I. INTRODUCTION Two-dimensional wavelet transform is an important tool in many image processing applications. Speaking of digital still image lossy compression, the Cohen-Daubechies-Feauveau 9/7 wavelet (CDF 9/7) is considered as the best solution concerning the classic dyadic approach [10] and hence it is used in JPEG2000 standard. Regarding the length of the filters, the standard computational approach that uses convolution with FIR filter bank structures wastes too many computational and memory resources. Another mathematical formulation called lifting-based wavelet transform and its efficient factorization has been proposed [5], requiring far fewer computations to obtain the same result as the standard algorithm. The Fast lifting wavelet transform is based on breaking the original filters to a series of so-called lifting steps, a sequence of upper and lower triangular matrices, which are applied to even and odd parts of the original signal. The algorithm is processed in-place, requiring no extra memory for each step, with the exception of temporary buffers used for coefficient ordering. Moreover, due to the reduced complexity of the steps, computational time can be reduced as far as 50 % of the original convolution approach duration, as documented by [8].Many specific implementations of the fast lifting transform algorithm are described in literature [1], [2]. In this article we focus on modern CPU architectures. Most recent processors tend to have large caching facilities and often employ methods of parallelism, and we try to exploit these facts by proposing some specific approaches of the transform.

show abstract

Section: Memory Layout Problemsmentioning

confidence: 99%

Lifting-based wavelet transform for images on modern CPU architectures

Maly

Rajmic

2008

2008 International Conference on Signals and Electronic Systems

View full text Add to dashboard Cite

show abstract

“…SIMD-vectorized wavelet transforms are presented in [11] and a SIMD-vectorized FFT library is presented in [45].…”

Section: B Vectorizing Codes For Short Vector Simd Extensionsmentioning

confidence: 99%

Efficient Utilization of SIMD Extensions

et al. 2005

View full text Add to dashboard Cite

Abstract-This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and IBM's BlueGene/L SIMD instructions.FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine's general purpose C compiler to maintain portability.The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and thus inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes (i) symbolic vectorization of DSP transforms, (ii) straight-line code vectorization for numerical kernels, and (iii) compiler backends for straight-line code with vector instructions.Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speed-ups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.

show abstract

“…In [6] we have extended these previous studies with a more detailed analysis based on hardware performance counters and a study of the vectorization on an Intel P-III microprocessor.…”

Section: Related Workmentioning

confidence: 99%

2-D Wavelet Transform Enhancement on General- Purpose Microprocessors: Memory Hierarchy and SIMD Parallelism Exploitation

Chaver

Tenllado

Piñuel

et al. 2002

High Performance Computing — HiPC 2002

Self Cite

View full text Add to dashboard Cite

This paper addresses the implementation of a 2-D Discrete Wavelet Transform on general-purpose microprocessors, focusing on both memory hierarchy and SIMD parallelization issues. Both topics are somewhat related, since SIMD extensions are only useful if the memory hierarchy is efficiently exploited. In this work, locality has been significantly improved by means of a novel approach called pipelined computation, which complements previous techniques based on loop tiling and non-linear layouts. As experimental platforms we have employed a Pentium-III (P-III) and a Pentium-4 (P-4) microprocessor. However, our SIMD-oriented tuning has been exclusively performed at source code level. Basically, we have reordered some loops and introduced some modifications that allow automatic vectorization. Taking into account the abstraction level at which the optimizations are carried out, the speedups obtained on the investigated platforms are quite satisfactory, even though further improvement can be obtained by dropping the level of abstraction (compiler intrinsics or assembly code).

show abstract

Wavelet Transform for Large Scale Image Processing on Modern Microprocessors

Cited by 8 publications

References 10 publications

Lifting-based wavelet transform for images on modern CPU architectures

Lifting-based wavelet transform for images on modern CPU architectures

Efficient Utilization of SIMD Extensions

2-D Wavelet Transform Enhancement on General- Purpose Microprocessors: Memory Hierarchy and SIMD Parallelism Exploitation

Contact Info

Product

Resources

About