Data transformations enabling loop vectorization on multithreaded data parallel architectures

Jang, Byunghyun; Mistry, Perhaad; Schaa, Dana; Domínguez, Rodrigo; Kaeli, David

doi:10.1145/1693453.1693510

Cited by 17 publications

(9 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Rivera and Tseng [35] presented data padding techniques to avoid conflict misses. Recently, linear data layout transformations to improve vector performance have been proposed [15].…”

Section: Related Workmentioning

confidence: 99%

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty

Stock

Pouchet

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on several modern processors with SIMD capabilities.

show abstract

“…Rivera and Tseng [35] presented data padding techniques to avoid conflict misses. Recently, linear data layout transformations to improve vector performance have been proposed [15].…”

Section: Related Workmentioning

confidence: 99%

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty

Stock

Pouchet

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…For GPUs, we know of no previous work applying data layout transformation to structured-grid codes other than for gaining unit-strided accesses [11,27], which helps vectorizing memory accesses into DRAM bursts (i.e. coalescing).…”

Section: Common Access Patterns Of Pde Solvers On Structured Gridsmentioning

confidence: 99%

“…Intuitively, this can be addressed by loop transformations to achieve unit-strided access in the inner loop. However, for arrays of structures, it is necessary to employ data layout transformations, such as dimension permutation, to achieve vectorization [11] or reduce coherence overhead [12].…”

Section: Introductionmentioning

confidence: 99%

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

Sung

Anssari

Stratton

et al. 2011

Int J Parallel Prog

View full text Add to dashboard Cite

We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we enable automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 10.94X speedup over the original layout, and a 1.16X performance gain in the worst case.

show abstract

“…This series of kernels are invoked inside a loop iteration, with each loop iteration processing a subset of the input data set that fits nicely in GPU memory. We have shown in prior work that it is critical to perform a proper mapping of the data set to the GPU memory subsystem to obtain high performance [9].…”

Section: Figure 3: the Lof Algorithmmentioning

confidence: 99%

Accelerating the local outlier factor algorithm on a GPU for intrusion detection systems

Alshawabkeh

Jang

Kaeli

2010

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units

Self Cite

View full text Add to dashboard Cite

The Local Outlier Factor (LOF) is a very powerful anomaly detection method available in machine learning and classification. The algorithm defines the notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned an LOF which represents the likelihood of that object being an outlier. Although this concept of a local outlier is a useful one, the computation of LOF values for every data object requires a large number of k-nearest neighbor queries -this overhead can limit the use of LOF due to the computational overhead involved.Due to the growing popularity of Graphics Processing Units (GPU) in general-purpose computing domains, and equipped with a high-level programming language designed specifically for general-purpose applications (e.g., CUDA), we look to apply this parallel computing approach to accelerate LOF. In this paper we explore how to utilize a CUDA-based GPU implementation of the k-nearest neighbor algorithm to accelerate LOF classification. We achieve more than a 100X speedup over a multi-threaded dual-core CPU implementation. We also consider the impact of input data set size, the neighborhood size (i.e., the value of k) and the feature space dimension, and report on their impact on execution time.

show abstract

Data transformations enabling loop vectorization on multithreaded data parallel architectures

Cited by 17 publications

References 3 publications

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

Accelerating the local outlier factor algorithm on a GPU for intrusion detection systems

Contact Info

Product

Resources

About