Auto-vectorization of interleaved data for SIMD

Nuzman, Dorit; Rosen, Ira; Zaks, Ayal

doi:10.1145/1133255.1133997

Cited by 75 publications

(71 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Using the simple canonical approach from Section 2.2, we might generate loads and stores of the same data more than once. Similar to Nuzman et al [2006], we exploit this spatial locality by allowing multiple accesses to share mapped register sets when interleaving/deinterleaving, reducing the number of memory operations in the vectorized loop.…”

Section: Exploiting Spatial Locality: Grouping Multiple Interleaved Amentioning

confidence: 99%

“…Investigations of bottlenecks in SIMD programs have identified non-unit-stride memory access patterns as a particular concern [Talla et al 2003;Maleki et al 2011;Schaub et al 2015]. Nuzman et al [2006] proposed an auto-vectorization algorithm for interleaved data access patterns where the stride is a power-of-two. Given a loop with such an access pattern, the algorithm generates extremely efficient vectorized code by directly exploiting the structure of the access pattern.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automatic Vectorization of Interleaved Data Revisited

Anderson

Malik

Gregg

2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires additional work to gather and scatter the data. There are several varieties of non-contiguous access, including interleaved data access. An existing approach used by GCC generates extremely efficient code for loops with power-of-two interleaving factors (strides). In this paper we propose a generalization of this approach that produces similar code for any compile-time constant interleaving factor. In addition, we propose several novel program transformations which were made possible by our generalized representation of the problem. Experiments show that our approach achieves significant speedups for both power-of-two and non-power-of-two interleaving factors. Our vectorization approach results in mean speedups over scalar code of 1.77x on Intel SSE and 2.53x on Intel AVX2 in real-world benchmarking on a selection of BLAS Level 1 routines. On the same benchmark programs, GCC 5.0 achieves mean improvements of 1.43x on Intel SSE and 1.30x on Intel AVX2. In synthetic benchmarking on Intel SSE, our maximum improvement on data movement is over 4x for gathering operations and over 6x for scattering operations versus scalar code.

show abstract

Section: Exploiting Spatial Locality: Grouping Multiple Interleaved Amentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Automatic Vectorization of Interleaved Data Revisited

Anderson

Malik

Gregg

2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…There has been significant recent work in generating effectice code for SIMD vector instruction sets in the presence of hardware alignment and stride constraints as described in [12,44,45,31,13]. The difficulties of optimizing for a wide range of SIMD vector architectures are discussed in [29,14].…”

Section: Related Workmentioning

confidence: 99%

“…The difficulties of optimizing for a wide range of SIMD vector architectures are discussed in [29,14]. In addition, several other works have addressed the exploitation of SIMD instruction sets [22,24,23,30,32,31,28]. All of these works only address SIMD hardware alignment issues.…”

Section: Related Workmentioning

confidence: 99%

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty

Stock

Pouchet

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on several modern processors with SIMD capabilities.

show abstract

“…Various impactful techniques have been applied to automatically generate SIMD code and to address the difficulties during vectorizing such as data permutations [8], interleaved data [9], etc. However, the optimizing approaches employed by those compilers still cannot drastically eliminate the irregular and non-aligned obstacles.…”

Section: Introductionmentioning

confidence: 99%

An SIMD Code Generation Technology for Indirect Array

Li¹,

Zhao²,

Zhang³

et al. 2016

IJCTE

View full text Add to dashboard Cite

Abstract-Due to disjoint memory references and non-aligned memory references, existing SIMD compilers can't vectorize loops containing indirect array utilizing SIMD (single instruction multiple data) instructions. However, addressing this problem is inevitable, since many important applications extensively use this program pattern to reduce memory and computation requirement. In this paper, we propose a new efficient code generation technique for indirect array. For an irregular indirect array access, we adopt two separately registers to store the array base and the index address. It significantly contributes to the performance improvement by vectorizing more loops and reducing the overheads. We also developed this method in our auto-vectorization compiler SW-VEC. The experimental results show that the proposed method can translate applications within direct array access into high-performance targeted vectorized codes, thereby advancing the execution efficiency adequately.

show abstract

Auto-vectorization of interleaved data for SIMD

Cited by 75 publications

References 23 publications

Automatic Vectorization of Interleaved Data Revisited

Automatic Vectorization of Interleaved Data Revisited

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

An SIMD Code Generation Technology for Indirect Array

Contact Info

Product

Resources

About