Avoiding Conversion and Rearrangement Overhead in SIMD Architectures

Shahbahrami, Asadollah; Juurlink, Ben; Borodin, Demid; Vassiliadis, S.

doi:10.1007/s10766-006-0015-0

Cited by 4 publications

(3 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This phase of the 2-D DWT is difficult to vectorize efficiently because the elements within a register need to be rearranged, incurring substantial overhead. Techniques we are considering include providing support for packed multiply-accumulate instructions for floating-point values (MMX/SSE provides such instructions but only for integer data) and the matrix register file (MRF) [23], which is a (micro-)architectural technique to efficiently support matrix transposition.…”

Section: Discussionmentioning

confidence: 99%

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

Shahbahrami

Juurlink

Vassiliadis

2008

IEEE Trans. Multimedia

Self Cite

View full text Add to dashboard Cite

Abstract-The 2-D Discrete Wavelet Transform (DWT) consumes up to 68% of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying loop interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively.

show abstract

Section: Discussionmentioning

confidence: 99%

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

Shahbahrami

Juurlink

Vassiliadis

2008

IEEE Trans. Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

“…We have evaluated our proposed techniques in a previous paper [Shahbahrami et al 2006a] using some 2-D multimedia kernels, such as 2-D discrete cosine transform (DCT) and its inverse (IDCT), Paeth prediction, 2 × 2 Haar transform and its inverse, vector/matrix multiplication, matrix transpose, and addition of two images. Figure 2 eliminates the matrix transposition step which is required in some kernels, for instance, 2-D (I)DCT and vector/matrix multiplication.…”

Section: Related Workmentioning

confidence: 99%

“…In addition, we discuss the new SIMD instructions and provide a preliminary evaluation of the hardware cost of the proposed techniques. More details about the MMMX architecture can be found in previous work [Shahbahrami et al 2006a[Shahbahrami et al , 2006b[Shahbahrami et al , 2006c.…”

Section: Architecturementioning

confidence: 99%

Versatility of extended subwords and the matrix register file

Shahbahrami

Juurlink

Vassiliadis

2008

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Extended subwords and the matrix register file (MRF) are two micro architectural techniques that address some of the limitations of existing SIMD architectures. Extended subwords are wider than the data stored in memory. Specifically, for every byte of data stored in memory, there are four extra bits in the media register file. This avoids the need for data-type conversion instructions. The MRF is a register file organization that provides both conventional row-wise, as well as columnwise, access to the register file. In other words, it allows to view the register file as a matrix in which corresponding subwords in different registers corresponds to a column of the matrix. It was introduced to accelerate matrix transposition which is a very common operation in multimedia applications. In this paper, we show that the MRF is very versatile, since it can also be used for other permutations than matrix transposition. Specifically, it is shown how it can be used to provide efficient access to strided data, as is needed in, e.g., color space conversion. Furthermore, it is shown that special-purpose instructions (SPIs), such as the sum-of-absolute differences (SAD) instruction, have limited usefulness when extended subwords and a few general SIMD instructions that we propose are supported, for the following reasons. First, when extended subwords are supported, the SAD instruction provides only a relatively small performance improvement. Second, the SAD instruction processes 8-bit subwords only, which is not sufficient for quarter-pixel resolution nor for cost functions used in image and video retrieval. Results obtained by extending the SimpleScalar toolset show that the proposed techniques provide a speedup of up to 3.00 over the MMX architecture. The results also show that using, at most, 13 extra media registers yields an additional performance improvement ranging from 1.38 to 1.57.

show abstract

Scalar Processing Overhead on SIMD-Only Architectures

Filho

Juurlink

2009

2009 20th IEEE International Conference on Application-Specific Systems, Architectures and Processors

View full text Add to dashboard Cite

Avoiding Conversion and Rearrangement Overhead in SIMD Architectures

Cited by 4 publications

References 25 publications

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

Versatility of extended subwords and the matrix register file

Scalar Processing Overhead on SIMD-Only Architectures

Contact Info

Product

Resources

About