Abstract:Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data reorganization manipulations. Computations on non-contiguous and especially interleaved data appear in important applications, which can greatly benefit from SIMD instructions once the data is reorganized properly. Vectorizing such computations efficiently is therefore an am… Show more
“…Using the simple canonical approach from Section 2.2, we might generate loads and stores of the same data more than once. Similar to Nuzman et al [2006], we exploit this spatial locality by allowing multiple accesses to share mapped register sets when interleaving/deinterleaving, reducing the number of memory operations in the vectorized loop.…”
“…Investigations of bottlenecks in SIMD programs have identified non-unit-stride memory access patterns as a particular concern [Talla et al 2003;Maleki et al 2011;Schaub et al 2015]. Nuzman et al [2006] proposed an auto-vectorization algorithm for interleaved data access patterns where the stride is a power-of-two. Given a loop with such an access pattern, the algorithm generates extremely efficient vectorized code by directly exploiting the structure of the access pattern.…”
Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires additional work to gather and scatter the data. There are several varieties of non-contiguous access, including interleaved data access. An existing approach used by GCC generates extremely efficient code for loops with power-of-two interleaving factors (strides). In this paper we propose a generalization of this approach that produces similar code for any compile-time constant interleaving factor. In addition, we propose several novel program transformations which were made possible by our generalized representation of the problem. Experiments show that our approach achieves significant speedups for both power-of-two and non-power-of-two interleaving factors. Our vectorization approach results in mean speedups over scalar code of 1.77x on Intel SSE and 2.53x on Intel AVX2 in real-world benchmarking on a selection of BLAS Level 1 routines. On the same benchmark programs, GCC 5.0 achieves mean improvements of 1.43x on Intel SSE and 1.30x on Intel AVX2. In synthetic benchmarking on Intel SSE, our maximum improvement on data movement is over 4x for gathering operations and over 6x for scattering operations versus scalar code.
“…Using the simple canonical approach from Section 2.2, we might generate loads and stores of the same data more than once. Similar to Nuzman et al [2006], we exploit this spatial locality by allowing multiple accesses to share mapped register sets when interleaving/deinterleaving, reducing the number of memory operations in the vectorized loop.…”
“…Investigations of bottlenecks in SIMD programs have identified non-unit-stride memory access patterns as a particular concern [Talla et al 2003;Maleki et al 2011;Schaub et al 2015]. Nuzman et al [2006] proposed an auto-vectorization algorithm for interleaved data access patterns where the stride is a power-of-two. Given a loop with such an access pattern, the algorithm generates extremely efficient vectorized code by directly exploiting the structure of the access pattern.…”
Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires additional work to gather and scatter the data. There are several varieties of non-contiguous access, including interleaved data access. An existing approach used by GCC generates extremely efficient code for loops with power-of-two interleaving factors (strides). In this paper we propose a generalization of this approach that produces similar code for any compile-time constant interleaving factor. In addition, we propose several novel program transformations which were made possible by our generalized representation of the problem. Experiments show that our approach achieves significant speedups for both power-of-two and non-power-of-two interleaving factors. Our vectorization approach results in mean speedups over scalar code of 1.77x on Intel SSE and 2.53x on Intel AVX2 in real-world benchmarking on a selection of BLAS Level 1 routines. On the same benchmark programs, GCC 5.0 achieves mean improvements of 1.43x on Intel SSE and 1.30x on Intel AVX2. In synthetic benchmarking on Intel SSE, our maximum improvement on data movement is over 4x for gathering operations and over 6x for scattering operations versus scalar code.
“…There has been significant recent work in generating effectice code for SIMD vector instruction sets in the presence of hardware alignment and stride constraints as described in [12,44,45,31,13]. The difficulties of optimizing for a wide range of SIMD vector architectures are discussed in [29,14].…”
Section: Related Workmentioning
confidence: 99%
“…The difficulties of optimizing for a wide range of SIMD vector architectures are discussed in [29,14]. In addition, several other works have addressed the exploitation of SIMD instruction sets [22,24,23,30,32,31,28]. All of these works only address SIMD hardware alignment issues.…”
Abstract. Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on several modern processors with SIMD capabilities.
“…Various impactful techniques have been applied to automatically generate SIMD code and to address the difficulties during vectorizing such as data permutations [8], interleaved data [9], etc. However, the optimizing approaches employed by those compilers still cannot drastically eliminate the irregular and non-aligned obstacles.…”
Abstract-Due to disjoint memory references and non-aligned memory references, existing SIMD compilers can't vectorize loops containing indirect array utilizing SIMD (single instruction multiple data) instructions. However, addressing this problem is inevitable, since many important applications extensively use this program pattern to reduce memory and computation requirement. In this paper, we propose a new efficient code generation technique for indirect array. For an irregular indirect array access, we adopt two separately registers to store the array base and the index address. It significantly contributes to the performance improvement by vectorizing more loops and reducing the overheads. We also developed this method in our auto-vectorization compiler SW-VEC. The experimental results show that the proposed method can translate applications within direct array access into high-performance targeted vectorized codes, thereby advancing the execution efficiency adequately.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.