Vectorizing for a SIMdD DSP architecture

Naishlos, Dorit; Biberstein, Marina; Ben-David, Shay; Zaks, Ayal

doi:10.1145/951713.951714

Cited by 11 publications

(19 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The SIMdD (Single Instruction Multiple disjoint Data) architecture contains the multi-port vector memory unit which allows accessing disjoint data with no overhead [8]. Although the SIMdD shows a good performance, it is based on the costly multi-port memory.…”

Section: Related Workmentioning

confidence: 99%

Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Chang

Cho

Sung

2008

J Sign Process Syst Sign Image Video Technol

View full text Add to dashboard Cite

The single instruction multiple data (SIMD) architecture is very efficient for executing arithmetic intensive programs, but frequently suffers from dataalignment problems. The data-alignment problem not only induces extra time overhead but also hinders automatic vectorization of the SIMD compiler. In this paper, we compare three on-chip memory systems, which are single-bank, multi-bank, and multi-port, for the SIMD architecture to resolve the data-alignment problems. The single-bank memory is the simplest, but supports only the aligned accesses. The multi-bank memory requires a little higher complexity, but enables the unaligned accesses and the stride accesses with a bank-conflict limitation. The multi-port memory is capable of both the unaligned and stride accesses without any restriction, but needs quite much expensive hardware. We also developed a vectorizing compiler that can conduct dynamic memory allocation and SIMD code generation. The performances of the three memory systems with our SIMD compiler are evaluated using several digital signal processing kernels and the MPEG2 encoder. The experimental results show that the multi-bank memory can carry out MPEG2 encoding 5.8 times faster, whereas the single-bank memory only achieves 2.9 times speed-up when employed in a multimedia system with a 2-issue host processor and an 8-way SIMD coprocessor. The multi-port memory obviously shows the best performance, which is however an impractical improvement over the multi-bank memory when the hardware cost is considered.

show abstract

Section: Related Workmentioning

confidence: 99%

Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Chang

Cho

Sung

2008

J Sign Process Syst Sign Image Video Technol

View full text Add to dashboard Cite

show abstract

“…A different approach to eliminate data-permutation instructions named single-instruction multiple disjoint data (SIMdD) has been proposed in the eLite DSP architecture [Moreno et al 2003;Naishlos et al 2003]. Instead of a vector register file, the eLite DSP employs a large scalar register file, the vector element file (VEF).…”

Section: Related Workmentioning

confidence: 99%

Versatility of extended subwords and the matrix register file

Shahbahrami

Juurlink

Vassiliadis

2008

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Extended subwords and the matrix register file (MRF) are two micro architectural techniques that address some of the limitations of existing SIMD architectures. Extended subwords are wider than the data stored in memory. Specifically, for every byte of data stored in memory, there are four extra bits in the media register file. This avoids the need for data-type conversion instructions. The MRF is a register file organization that provides both conventional row-wise, as well as columnwise, access to the register file. In other words, it allows to view the register file as a matrix in which corresponding subwords in different registers corresponds to a column of the matrix. It was introduced to accelerate matrix transposition which is a very common operation in multimedia applications. In this paper, we show that the MRF is very versatile, since it can also be used for other permutations than matrix transposition. Specifically, it is shown how it can be used to provide efficient access to strided data, as is needed in, e.g., color space conversion. Furthermore, it is shown that special-purpose instructions (SPIs), such as the sum-of-absolute differences (SAD) instruction, have limited usefulness when extended subwords and a few general SIMD instructions that we propose are supported, for the following reasons. First, when extended subwords are supported, the SAD instruction provides only a relatively small performance improvement. Second, the SAD instruction processes 8-bit subwords only, which is not sufficient for quarter-pixel resolution nor for cost functions used in image and video retrieval. Results obtained by extending the SimpleScalar toolset show that the proposed techniques provide a speedup of up to 3.00 over the MMX architecture. The results also show that using, at most, 13 extra media registers yields an additional performance improvement ranging from 1.38 to 1.57.

show abstract

“…The FIR filter was selected for several reasons. First, it can be vectorized in a number of ways, for example with either inner-loop or outer-loop vectorization, on different SIMD platforms [10,16,25,8]. Figure 4 demonstrates the difference between inner-loop vectorization (regular vectorization of the innermost loop) and outer-loop vectorization, using a vector size of 4.…”

Section: Benchmark Descriptionmentioning

confidence: 99%

“…The compiler can then arrange to exploit this data reuse by computing the overall range of data that is being accessed throughout the execution of a loop, and preload it in advance into the vectorregister file, thereby making sure that all elements in that range are loaded exactly once. This data can then be accessed indirectly via the vector maps of iVMX, updated to index "sliding windows" of vector registers, similarly to the eLite architecture with its vector pointers [16].…”

Section: Compiler Optimizationsmentioning

confidence: 99%

See 1 more Smart Citation

Compiling for an indirect vector register architecture

Nuzman

Namolaru

Zaks

et al. 2008

Proceedings of the 5th Conference on Computing Frontiers

Self Cite

View full text Add to dashboard Cite

The iVMX architecture contains a novel vector register file of up to 4096 vector registers accessed indirectly via a mapping mechanism, providing compatibility with the VMX architecture, and potential for dramatic performance benefits [7]. The large number of vector registers and the unique indirection mechanism pose compilation challenges to be used efficiently: the indirection mechanism emphasizes spatial locality of registers and interaction among destination and source operands during register allocation, and the many vector registers call for aggressive automatic vectorization.This work is a first step in addressing the compilability of iVMX, following the presentation and validation of its architectural aspects [7]. In this paper we present several compilation approaches to deal with the mapping mechanism and an outer-loop vectorization transformation developed to promote the use of many vector registers. We modified an existing register allocator to target all available registers and added a post-pass to rename live-ranges considering spatial locality and interaction among operand types. An FIR filter is used to demonstrate the effectiveness of the techniques developed compared to a version hand-optimized for iVMX. Initial results show that we can reduce the overhead of map management down to 29% of the total instruction count, compared to 22% obtained manually, and compared to 49% obtained using a naive scheme, while outperforming an equivalent VMX implementation by a factor of 2.

show abstract

Vectorizing for a SIMdD DSP architecture

Cited by 11 publications

References 0 publications

Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Versatility of extended subwords and the matrix register file

Compiling for an indirect vector register architecture

Contact Info

Product

Resources

About