2005
DOI: 10.1109/jproc.2004.840491
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Utilization of SIMD Extensions

Abstract: Abstract-This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and IBM's BlueGene/L SIMD instructions.FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tunin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
25
0
2

Year Published

2005
2005
2012
2012

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 53 publications
(34 citation statements)
references
References 39 publications
0
25
0
2
Order By: Relevance
“…That is why a wide range of accelerator modules have been included, as specific-purpose arithmetic units and sets of instructions or co-processors. The inclusion of SIMD units is decisive for tasks such as video encoding and decoding, but any data-intensive algorithm can take advantage of them (Franchetti et al, 2005).…”
Section: Microprocessorsmentioning
confidence: 99%
“…That is why a wide range of accelerator modules have been included, as specific-purpose arithmetic units and sets of instructions or co-processors. The inclusion of SIMD units is decisive for tasks such as video encoding and decoding, but any data-intensive algorithm can take advantage of them (Franchetti et al, 2005).…”
Section: Microprocessorsmentioning
confidence: 99%
“…As a result, vectorization generally targets inner-most loops. The vectorization or simdization can be categorized into two principal approaches: the traditional loop-based parallelization [13,28,36,24] and the basic block approach [23,19,35].…”
Section: Related Workmentioning
confidence: 99%
“…In previous work, we developed a formal vectorization approach [6] and applied it successfully across a wide range of short vector SIMD platforms for vector lengths of two and four both to Fftw [4,5] and Spiral [7,8,9]. We showed that neither original vector computer FFT algorithms [17,25] nor vectorizing compilers [13,18] are capable of producing high-performance FFT implementations for short vector SIMD architectures, even in tandem with automatic performance tuning [9].…”
Section: Formal Vectorizationmentioning
confidence: 99%
“…A detailed description of our formal vectorization method and its application to a wide range of short vector SIMD architectures can be found in [6,7,9]. …”
Section: Algorithm 1 (Short Vector Cooley-tukey Fft)mentioning
confidence: 99%
See 1 more Smart Citation