Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Jang, Byunghyun; Schaa, Dana; Mistry, Perhaad; Kaeli, David

doi:10.1109/tpds.2010.107

Cited by 159 publications

(83 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This type of memory is advantageous when accessing large and contiguous regions. As long as the memory access pattern is optimized, it can effectively handle hundreds or thousands simultaneous data read or write transactions [26]. On the other hand, frequent lightweight memory accesses to random regions of the global memory lead to bottlenecks in data transfer due to the high latency, resulting in a serious decrease of the performance of the parallelized application.…”

Section: A Gpu Processorsmentioning

confidence: 99%

Perfect Hashing Structures for Parallel Similarity Searches

Tran

Giraud

Varré

2015

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

View full text Add to dashboard Cite

Abstract-Seed-based heuristics have proved to be efficient for studying similarity between genetic databases with billions of base pairs. This paper focuses on algorithms and data structures for the filtering phase in seed-based heuristics, with an emphasis on efficient parallel GPU/manycores implementation. We propose a 2-stage index structure which is based on neighborhood indexing and perfect hashing techniques. This structure performs a filtering phase over the neighborhood regions around the seeds in constant time and avoid as much as possible random memory accesses and branch divergences. Moreover, it fits particularly well on parallel SIMD processors, because it requires intensive but homogeneous computational operations. Using this data structure, we developed a fast and sensitive OpenCL prototype read mapper.

show abstract

Section: A Gpu Processorsmentioning

confidence: 99%

Perfect Hashing Structures for Parallel Similarity Searches

Tran

Giraud

Varré

2015

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

View full text Add to dashboard Cite

show abstract

“…The DL [19] work studies an Array-of-Structureof-Tiled-Array (ASTA) layout and in-place data marshaling for improving the device memory throughput for GPU. Jang et al [10] used a mathematical model and algorithms to analyze data access patterns and target loop vectorization and GPU memory selection with different patterns. Zhang et al [23] proposed a library to reduce irregularities in GPU programs through a level of indirection and job swapping to improve branch and memory divergence.…”

Section: Related Workmentioning

confidence: 99%

“…High application performance relies heavily on efficient memory bandwidth utilizations. Though GPUs usually have a wider memory interface than CPUs, performance would be suboptimal in the presence of insufficient memory coalescing [10], [20], [23].…”

Section: Introductionmentioning

confidence: 99%

Dymaxion++: A Directive-Based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems

Che

Meng

Skadron

2014

2014 IEEE International Parallel &Amp; Distributed Processing Symposium Workshops

View full text Add to dashboard Cite

Abstract-There has been a growing trend in using heterogeneous systems with CPUs and GPUs to solve diverse compute problems. However, high application performance on these platforms relies on efficient memory accesses. For many applications, CPUs and GPUs prefer different memory mappings and datastructure layouts. This in turn requires developers to use devicespecific strategies for memory access optimizations. Achieving both code and performance portability becomes a challenge for heterogeneous computing.This paper proposes a directive-based API, Dymaxion++, which enables programmers to optimize memory access patterns across devices with a simple interface. Use of Dymaxion++ requires only minimal modifications to existing codes with a small set of pragma extensions. The current framework augments the original Dymaxion framework [6] with a clean abstraction backed by a source-to-source code translator. Dymaxion++ also provides additional programming features to map data structures to GPU's hybrid memory spaces (e.g. texture and constant memory) for different uses. Additionally, data layout transformation is enabled while exchanging data between GPU scratchpad and device memory as well as between system memory and device memory.

show abstract

“…The CUDA grid and block indexing is column-major ordered, with the 'x' direction along the columns and the 'y' direction is along the rows, and the threads are scheduled in that order. Thus, when copying arrays from global memory to shared memory, it is important that the access pattern of the memory being copied matches the access pattern of the thread scheduler [34]. Since C arrays are row-major, a shared memory array A should be allocated (counter- Since in MATLAB (when using meshgrid() to form the x and y arrays of Ψ) the x direction of Ψ is along the rows of the Ψ array, and the y direction is along the columns, and when in the MEX file, this is transposed, then, as mentioned in Sec.…”

Section: Two-dimensional Specific Code Designmentioning

confidence: 99%

NLSEmagic: Nonlinear Schrödinger equation multi-dimensional Matlab-based GPU-accelerated integrators using compact high-order schemes

Caplan

2013

Computer Physics Communications

View full text Add to dashboard Cite

We present a simple to use, yet powerful code package called NLSEmagic to numerically integrate the nonlinear Schrödinger equation in one, two, and three dimensions. NLSEmagic is a high-order finite-difference code package which utilizes graphic processing unit (GPU) parallel architectures. The codes running on the GPU are many times faster than their serial counterparts, and are much cheaper to run than on standard parallel clusters. The codes are developed with usability and portability in mind, and therefore are written to interface with MATLAB utilizing custom GPUenabled C codes with the MEX-compiler interface. The packages are freely distributed, including user manuals and set-up files. The integrators utilize a fully-explicit fourth-order Runge-Kutta scheme in time and both second-and fourthorder differencing in space. The integrators are written to run on NVIDIA GPUs and are interfaced with MATLAB including built-in visualization and analysis tools. Restrictions:The main restriction for the GPU integrators is the amount of RAM on the GPU as the code is currently only designed for running on a single GPU. Unusual features:Ability to visualize real-time simulations through the interaction of MATLAB and the compiled GPU integrators. Additional comments:Program has a dedicated web site at www.nlsemagic.com. Running time:A three-dimensional run with a grid dimension of 87x87x203 for 3360 time steps (100 non-dimensional time units) takes about one and a half minutes on a GeForce GTX 580 GPU card.

show abstract

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Cited by 159 publications

References 15 publications

Perfect Hashing Structures for Parallel Similarity Searches

Perfect Hashing Structures for Parallel Similarity Searches

Dymaxion++: A Directive-Based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems

NLSEmagic: Nonlinear Schrödinger equation multi-dimensional Matlab-based GPU-accelerated integrators using compact high-order schemes

Contact Info

Product

Resources

About