A Review of SIMD Multimedia Extensions and their Usage in Scientific and Engineering Applications

Hassaballah, M.; Omran, Saleh; Mahdy, Yousef B.

doi:10.1093/comjnl/bxm099

Cited by 39 publications

(21 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Existing solutions for multimedia and matrix operations mostly focus on 1D [15] and 2D [3] arrangements of processing elements [16]. In this section, we use matrix multiplication to give a perspective of differences between 1D and 2D arrangement of PEs for matrix computations.…”

Section: A 1d Vs 2d Architecturesmentioning

confidence: 99%

“…Three main limitations of conventional 1D vector architectures are known to be complexity of the central register file, implementation difficulties of precise exception handling, and expensive onchip memory [20]. A detailed review of SIMD multimedia extensions and their bottlenecks are presented in [15], [39]. Associated costs are amplified by the fact that in each step a complete vector has to be transferred through multiple ports of a register file, wide wires, and complex point-to-point interconnects such as crossbars.…”

Section: B Related Workmentioning

confidence: 99%

See 1 more Smart Citation

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators

Pedram

Gerstlauer

Geijn

2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

Abstract-Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can eliminate the use of power hungry multi-ported register files in the context of data-parallel hardware accelerators for linear algebra operations. We demonstrate an algorithm/architecture co-design for the mapping of different collective communication operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, SYRK and matrix transposition. We compare a broadcast bus based architecture with conventional SIMD, 2D-SIMD and flat register file for these operations in terms of area and energy efficiency. Results show that fast broadcast data movement abilities in a prototypical linear algebra core can achieve up to 75x better power and up to 10x better area efficiency compared to traditional SIMD architectures.

show abstract

Section: A 1d Vs 2d Architecturesmentioning

confidence: 99%

Section: B Related Workmentioning

confidence: 99%

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators

Pedram

Gerstlauer

Geijn

2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

show abstract

“…All major vendors support vector instructions and the trend is pushing them to become wider and more powerful [1]. SIMD instruction set extensions are quite common today in both high performance and embedded microprocessors [2]. However, writing code that makes efficient use of these units and leads to platform-specific implementations is rather difficult [3].…”

Section: Introductionmentioning

confidence: 99%

Insufficient Vectorization: A New Method to Exploit Superword Level Parallelism

Gao¹,

Lin²,

Zhao³

et al. 2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYSingle-instruction multiple-data (SIMD) extension provides an energy-efficient platform to scale the performance of media and scientific applications while still retaining post-programmability. However, the major challenge is to translate the parallel resources of the SIMD hardware into real application performance. Currently, all the slots in the vector register are used when compilers exploit SIMD parallelism of programs, which can be called sufficient vectorization. Sufficient vectorization means all the data in the vector register is valid. Because all the slots which vector register provides must be used, the chances of vectorizing programs with low SIMD parallelism are abandoned by sufficient vectorization method. In addition, the speedup obtained by full use of vector register sometimes is not as great as that obtained by partial use. Specifically, the length of vector register provided by SIMD extension becomes longer, sufficient vectorization method cannot exploit the SIMD parallelism of programs completely. Therefore, insufficient vectorization method is proposed, which refer to partial use of vector register. First, the adaptation scene of insufficient vectorization is analyzed. Second, the methods of computing inter-iteration and intra-iteration SIMD parallelism for loops are put forward. Furthermore, according to the relationship between the parallelism and vector factor a method is established to make the choice of vectorization method, in order to vectorize programs as well as possible. Finally, code generation strategy for insufficient vectorization is presented. Benchmark test results show that insufficient vectorization method vectorized more programs than sufficient vectorization method by 107.5% and the performance achieved by insufficient vectorization method is 12.1% higher than that achieved by sufficient vectorization method. key words: SIMD extension, SIMD parallelism, vector register, insufficient vectorization

show abstract

“…For instance, a set of single-instruction multiple-data (SIMD) registers have been employed to parallelize data operations within the processor [29]. In 1999, Intel's Pentium III processor family first introduced the streaming SIMD extensions (SSE) instructions (Intel Corp; Santa Clara, California).…”

Section: Introductionmentioning

confidence: 99%

“…In 1999, Intel's Pentium III processor family first introduced the streaming SIMD extensions (SSE) instructions (Intel Corp; Santa Clara, California). SSE expand the SIMD execution model by a new set of 128-bit registers to provide the ability to perform SIMD operations on packed and scalar singleprecision floating-point values [29][30]. SSE2 was then introduced in 2001 along with the Pentium IV and Intel Xeon processors [30] to enable more computations in parallel.…”

Section: Introductionmentioning

confidence: 99%

Feasibility of controlling prosthetic hand using sonomyography signal in real time: Preliminary study

Chang¹,

Zheng²

2010

JRRD

View full text Add to dashboard Cite

Abstract-The morphological changes of muscle can be accurately detected by sonography, a process we have termed sonomyography (SMG). This article investigates the feasibility of using muscle thickness deformation SMG as a new signal source to control a prosthetic hand in real time. Thickness deformation SMG of the extensor muscle was measured by a block-matching algorithm during wrist extension-flexion; the amplitude of the deformation was used to control the prosthetic hand. We compared various fast-search algorithms to select the best one for real-time prosthetic control. The two-dimensional logarithmic search (TDL) algorithm, with and without streaming singleinstruction multiple-data extensions, showed excellent execution efficiency, with an overall mean correlation coefficient of about 0.99, a mean standard root-mean-square error <0.75, and a mean relative root-mean-square error <8.0% referenced to the crosscorrelation algorithm baseline. The mean frame rates were greater than the ultrasound sampling rate (12 Hz), indicating that TDL could be implemented in real-time control. These results demonstrate that only one muscle position is needed to control a prosthetic hand, allowing for proprioception of muscle tension, and that the SMG provides good control of the prosthetic hand, allowing it to proportionally open and close with a fast-search algorithm.

show abstract

A Review of SIMD Multimedia Extensions and their Usage in Scientific and Engineering Applications

Cited by 39 publications

References 33 publications

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators

Insufficient Vectorization: A New Method to Exploit Superword Level Parallelism

Feasibility of controlling prosthetic hand using sonomyography signal in real time: Preliminary study

Contact Info

Product

Resources

About