Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions

Bernabé, Gregorio; García, José M.; González, José María Faci

doi:10.1007/s11265-005-6651-6

Cited by 15 publications

(4 citation statements)

References 32 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Today, the standard MPEG-4 [23][24] supports an ad-hoc tool for encoding textures and still images, based on a wavelet algorithm. In a previous work [25], we have presented the implementation of a lossy encoder for medical video based on the 3D fast wavelet transform. This encoder achieves both high compression ratios and excellent quality, so that medical doctors can not find longer differences between the original and the reconstructed video.…”

Section: Previous Workmentioning

confidence: 99%

The 2D wavelet transform on emerging architectures: GPUs and multicores

Franco

Bernabé

Fernández

et al. 2011

J Real-Time Image Proc

Self Cite

View full text Add to dashboard Cite

We present in this paper several implementations of the 3D Fast Wavelet Transform (3D-FWT) on multicore CPUs and manycore GPUs. On the GPU side, we focus on CUDA and OpenCL programming to develop methods for an efficient mapping on manycores. On multicore CPUs, OpenMP and Pthreads are used as counterparts to maximize parallelism, and renowned techniques like tiling and blocking are exploited to optimize the use of memory. We evaluate these proposals and make a comparison between a new Fermi Tesla C2050 and an Intel Core 2 Quad Q6700. Speedups of the CUDA version are the best results, improving the execution times on CPU, ranging from 5.3x to 7.4x for different image sizes, and up to 81 times faster when communications are neglected. Meanwhile, OpenCL obtains solid gains which range from 2x factors on small frame sizes to 3x factors on larger ones.

show abstract

Section: Previous Workmentioning

confidence: 99%

The 2D wavelet transform on emerging architectures: GPUs and multicores

Franco

Bernabé

Fernández

et al. 2011

J Real-Time Image Proc

Self Cite

View full text Add to dashboard Cite

show abstract

“…Meerwald et al [12], Bernabe et al [13], Chrysafis and Ortega [14] and Lafruit et al [15] present different memory-optimized execution orders or localizations of the WT, offering various methods to avoid off-chip misses: [12] reduces conflict misses in the vertical WT filtering by modifying the data layout and improves the spatial locality by modifying the execution order. Bernabe et al [13] reduces the cache misses during vertical filtering by computing tiles of merged horizontal and vertical filtering, [14] further avoids misses during the higher WT levels by merging lines of computation over all the WT levels, while [15] offers the same advantages, but by merging in a block-based manner, which corresponds well to further processing blocks.…”

Section: Related Workmentioning

confidence: 99%

“…Bernabe et al [13] reduces the cache misses during vertical filtering by computing tiles of merged horizontal and vertical filtering, [14] further avoids misses during the higher WT levels by merging lines of computation over all the WT levels, while [15] offers the same advantages, but by merging in a block-based manner, which corresponds well to further processing blocks. Chaver et al [16] finally realizes a trade-off between the in-placing freedom and spatial locality present in certain implementation styles of the WT.…”

Section: Related Workmentioning

confidence: 99%

Exploiting Varying Resource Requirements in Wavelet-based Applications in Dynamic Execution Environments

Geelen

Ferentinos

Catthoor

et al. 2008

J Sign Process Syst Sign Image Video Technol

View full text Add to dashboard Cite

In the context of future dynamic applications, systems will exhibit unpredictably varying platform resource requirements. To deal with this, they will not only need to be programmable in terms of instruction set processors, but also at least partial reconfigurability will be required. In this context, it is important for applications to optimally exploit the memory hierarchy under varying memory availability. This article presents a mapping strategy for wavelet-based applications: depending on the encountered conditions, it switches to different memory optimized instantations or localizations, permitting up to 51% energy gains in memory accesses. Systematic and parameterized mapping guidelines indicate which localization should be selected when, for varying algorithmic wavelet parameters. The results have been formalized and generalized to be applicable to more general wavelet-based applications.

show abstract

“…SIMD extensions of the instruction sets are successfully used in 3D as well. For example, [41] describes the SIMD optimization of 3D wavelet transform. Tomographic reconstruction can be sped up by a factor of 3 using SSE, as described in [42].…”

mentioning

confidence: 99%

Optimizing Gaussian filtering of volumetric data using SSE

Vaško

Šrámek

2010

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYGaussian filtering is a basic operation commonly used in numerous image and volume processing algorithms. It is, therefore, desirable to perform it as efficiently as possible. Over the last decade CPUs have been successfully extended with several SIMD (Single Instruction Multiple Data) extensions, such as MMX, 3DNow!, and SSE series. In this paper we introduce a new technique for Gaussian filtering of volume data sets-the extended volume-together with its SIMD implementation using the SSE technology. We further introduce a SIMD optimized recursive IIR implementation of the Gaussian filter, and finally, we parallelize the SSE versions with the help of OpenMP (Open Multi-Processing). Experimental evaluation indicates that the SIMD implementation can significantly speed up both versions of the Gaussian filtering and that the non-recursive extended volume version is faster than the recursive IIR one for small widths of the Gaussian filter.

show abstract

Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions

Cited by 15 publications

References 32 publications

The 2D wavelet transform on emerging architectures: GPUs and multicores

The 2D wavelet transform on emerging architectures: GPUs and multicores

Exploiting Varying Resource Requirements in Wavelet-based Applications in Dynamic Execution Environments

Optimizing Gaussian filtering of volumetric data using SSE

Contact Info

Product

Resources

About