Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2010
DOI: 10.1145/1693453.1693510
|View full text |Cite
|
Sign up to set email alerts
|

Data transformations enabling loop vectorization on multithreaded data parallel architectures

Abstract: Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vecto… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2010
2010
2017
2017

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 17 publications
(9 citation statements)
references
References 3 publications
0
9
0
Order By: Relevance
“…Rivera and Tseng [35] presented data padding techniques to avoid conflict misses. Recently, linear data layout transformations to improve vector performance have been proposed [15].…”
Section: Related Workmentioning
confidence: 99%
“…Rivera and Tseng [35] presented data padding techniques to avoid conflict misses. Recently, linear data layout transformations to improve vector performance have been proposed [15].…”
Section: Related Workmentioning
confidence: 99%
“…For GPUs, we know of no previous work applying data layout transformation to structured-grid codes other than for gaining unit-strided accesses [11,27], which helps vectorizing memory accesses into DRAM bursts (i.e. coalescing).…”
Section: Common Access Patterns Of Pde Solvers On Structured Gridsmentioning
confidence: 99%
“…Intuitively, this can be addressed by loop transformations to achieve unit-strided access in the inner loop. However, for arrays of structures, it is necessary to employ data layout transformations, such as dimension permutation, to achieve vectorization [11] or reduce coherence overhead [12].…”
Section: Introductionmentioning
confidence: 99%
“…This series of kernels are invoked inside a loop iteration, with each loop iteration processing a subset of the input data set that fits nicely in GPU memory. We have shown in prior work that it is critical to perform a proper mapping of the data set to the GPU memory subsystem to obtain high performance [9].…”
Section: Figure 3: the Lof Algorithmmentioning
confidence: 99%