Efficient Utilization of SIMD Extensions

Franchetti, Franz; Král, Štefan; Lorenz, Juergen; Ueberhuber, Christoph W.

doi:10.1109/jproc.2004.840491

Cited by 53 publications

(34 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…That is why a wide range of accelerator modules have been included, as specific-purpose arithmetic units and sets of instructions or co-processors. The inclusion of SIMD units is decisive for tasks such as video encoding and decoding, but any data-intensive algorithm can take advantage of them (Franchetti et al, 2005).…”

Section: Microprocessorsmentioning

confidence: 99%

Towards the Optimal Hardware Architecture for Computer Vision

Nieto¹,

Vilariño²,

Sánchez³

2012

Machine Vision - Applications and Systems

View full text Add to dashboard Cite

Section: Microprocessorsmentioning

confidence: 99%

Towards the Optimal Hardware Architecture for Computer Vision

Nieto¹,

Vilariño²,

Sánchez³

2012

Machine Vision - Applications and Systems

View full text Add to dashboard Cite

“…As a result, vectorization generally targets inner-most loops. The vectorization or simdization can be categorized into two principal approaches: the traditional loop-based parallelization [13,28,36,24] and the basic block approach [23,19,35].…”

Section: Related Workmentioning

confidence: 99%

Automatic parallelization for graphics processing units

Leung

Lhoták

Lashari

2009

Proceedings of the 7th International Conference on Principles and Practice of Programming in Java

View full text Add to dashboard Cite

Accelerated graphics cards, or Graphics Processing Units (GPUs), have become ubiquitous in recent years. On the right kinds of problems, GPUs greatly surpass CPUs in terms of raw performance. However, because they are difficult to program, GPUs are used only for a narrow class of special-purpose applications; the raw processing power made available by GPUs is unused most of the time.This paper presents an extension to a Java JIT compiler that executes suitable code on the GPU instead of the CPU. Both static and dynamic features are used to decide whether it is feasible and beneficial to off-load a piece of code on the GPU. The paper presents a cost model that balances the speedup available from the GPU against the cost of transferring input and output data between main memory and GPU memory. The cost model is parameterized so that it can be applied to different hardware combinations. The paper also presents ways to overcome several obstacles to parallelization inherent in the design of the Java bytecode language: unstructured control flow, the lack of multi-dimensional arrays, the precise exception semantics, and the proliferation of indirect references.

show abstract

“…In previous work, we developed a formal vectorization approach [6] and applied it successfully across a wide range of short vector SIMD platforms for vector lengths of two and four both to Fftw [4,5] and Spiral [7,8,9]. We showed that neither original vector computer FFT algorithms [17,25] nor vectorizing compilers [13,18] are capable of producing high-performance FFT implementations for short vector SIMD architectures, even in tandem with automatic performance tuning [9].…”

Section: Formal Vectorizationmentioning

confidence: 99%

“…A detailed description of our formal vectorization method and its application to a wide range of short vector SIMD architectures can be found in [6,7,9]. …”

Section: Algorithm 1 (Short Vector Cooley-tukey Fft)mentioning

confidence: 99%

See 1 more Smart Citation

Automatically Tuned FFTs for BlueGene/L’s Double FPU

Franchetti

Král

Lorenz

et al. 2005

High Performance Computing for Computational Science - VECPAR 2004

View full text Add to dashboard Cite

Abstract. IBM is currently developing the new line of BlueGene/L supercomputers. The top-of-the-line installation is planned to be a 65,536 processors system featuring a peak performance of 360 Tflop/s. This system is supposed to lead the Top 500 list when being installed in 2005 at the Lawrence Livermore National Laboratory. This paper presents one of the first numerical kernels run on a prototype BlueGene/L machine. We tuned our formal vectorization approach as well as the Vienna MAP vectorizer to support BlueGene/L's custom two-way short vector SIMD "double" floating-point unit and connected the resulting methods to the automatic performance tuning systems Spiral and Fftw. Our approach produces automatically tuned high-performance FFT kernels for BlueGene/L that are up to 45 % faster than the best scalar Spiral generated code and up to 75 % faster than Fftw when run on a single BlueGene/L processor.

show abstract

Efficient Utilization of SIMD Extensions

Cited by 53 publications

References 39 publications

Towards the Optimal Hardware Architecture for Computer Vision

Towards the Optimal Hardware Architecture for Computer Vision

Automatic parallelization for graphics processing units

Automatically Tuned FFTs for BlueGene/L’s Double FPU

Contact Info

Product

Resources

About