Understanding the impact of CUDA tuning techniques for Fermi

Torres, Yuri; González-Escribano, Arturo; Llanos, Diego R.

doi:10.1109/hpcsim.2011.5999886

Cited by 36 publications

(15 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Fermi architecture, when L1 cache memory is active, the size of a global memory transaction is 128 bytes. If L1 cache memory is not active, it is 32 bytes . With noncoalesced memory access, part of the data brought from memory are unused, so the bandwidth is not being effectively used.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Parallel evaluation of nonseparable functions by evolutionary algorithms on GPU

Cárdenas‐Montes

Vega‐Rodríguez

Rodríguez-Vázquez

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary Soft computing takes advantage of the computational capabilities provided by graphics processing units (GPUs), as it is reflected in the numerous works published every year. However, comparisons among these works are challenging because of their peculiarities. When evaluating evolutionary algorithms on GPUs, the data layout is a commonality for all the cases. In the current work the most promising data layout for a parallel evaluation of evolutionary algorithms on GPU is evaluated. The general scope of this work makes it broadly applicable, being useful for accelerating the fitness calculation of large instances of any population‐based evolutionary algorithm. For optimal performance to be achieved in this evaluation, it should be done through a hardware‐software co‐design approach. The co‐design process might imply a risk of overfitting. Because of this, a trade‐off in the co‐design approach is necessary for long‐term sustainability of the performance of such code. As a consequence of this study, a statement about the most promising data layout for evaluating large instances of population‐based evolutionary algorithms on GPU is presented. From the different approaches studied, the strategy with allocation of 1 individual per thread on registers with coalesced access to global memory on both Fermi and Kepler architectures outperforms all the other strategies.

show abstract

Section: Methodsmentioning

confidence: 99%

“…If L1 cache memory is not active, it is 32 bytes. 33 With noncoalesced memory access, part of the data brought from memory are unused, so the bandwidth is not being effectively used. If this bandwidth waste is mitigated, then a certain improvement of the performance is expected.…”

Section: Advanced Strategy 1: Allocation Of 1 Individual Per Thread Omentioning

confidence: 99%

Parallel evaluation of nonseparable functions by evolutionary algorithms on GPU

Cárdenas‐Montes

Vega‐Rodríguez

Rodríguez-Vázquez

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…The factors affecting the occupancy are the thread block size, shared memory used by each thread block, and the registers used by each thread. Tuning thread block size can have a significant effect on performance [24]. Therefore, we operate towards the target of keeping the occupancy of the new kernel as high as possible by tuning the thread block size.…”

Section: Tuning Thread Block Sizementioning

confidence: 99%

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications

Wahib

Maruyama

2015

Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

View full text Add to dashboard Cite

This paper proposes an end-to-end framework for automatically transforming stencil-based CUDA programs to exploit inter-kernel data locality. The CUDA-to-CUDA transformation collectively replaces the user-written kernels by auto-generated kernels optimized for data reuse. The transformation is based on two basic operations, kernel fusion and fission, and relies on a series of automated steps: gathering metadata, generating graphs expressing dependencies and precedency constraints, searching for optimal kernel fissions/fusions, and generation of optimized code. The framework is modeled to provide the flexibility required for accommodating different applications, allowing the programmer to monitor and amend the intermediate results of different phases of the transformation. We demonstrate the practicality and effectiveness of automatic transformations in exploiting exposed data localities using a variety of real-world applications with large codebases that contain dozens of kernels and data arrays. Experimental results show that the proposed end-to-end automated approach, with minimum intervention from the user, improved performance of six applications with speedups ranging between 1.12x to 1.76x.

show abstract

“…In addition, these studies usually change the programs themselves, while our work attempts to analyze memory behaviors of given programs. One study [17] provides a limited observation of GPU cache impact on a handful of simple kernels. In contrast, our work provides a more systematic characterization of GPU cache effectiveness and uses that to develop an algorithm for automating the choice of how and when to use demand-fetched caches.…”

Section: Related Workmentioning

confidence: 99%

Characterizing and improving the use of demand-fetched caches in GPUs

Jia

Shaw

Martonosi

2012

Proceedings of the 26th ACM International Conference on Supercomputing

128

View full text Add to dashboard Cite

Initially introduced as special-purpose accelerators for games and graphics code, graphics processing units (GPUs) have emerged as widely-used high-performance parallel computing platforms. GPUs traditionally provided only softwaremanaged local memories (or scratchpads) instead of demandfetched caches. Increasingly, however, GPUs are being used in broader application domains where memory access patterns are both harder to analyze and harder to manage in software-controlled caches. In response, GPU vendors have included sizable demand-fetched caches in recent chip designs. Nonetheless, several problems remain. First, since these hardware caches are quite new and highly-configurable, it can be difficult to know when and how to use them; they sometimes degrade performance instead of improving it. Second, since GPU programming is quite distinct from general-purpose programming, application programmers do not yet have solid intuition about which memory reference patterns are amenable to demand-fetched caches.In response, this paper characterizes application performance on GPUs with caches and provides a taxonomy for reasoning about different types of access patterns and locality. Based on this taxonomy, we present an algorithm which can be automated and applied at compile-time to identify an application's memory access patterns and to use that information to intelligently configure cache usage to improve application performance. Experiments on real GPU systems show that our algorithm reliably predicts when GPU caches will help or hurt performance. Compared to always passively turning caches on, our method can increase the average benefit of caches from 5.8% to 18.0% for applications that have significant performance sensitivity to caching.

show abstract

Understanding the impact of CUDA tuning techniques for Fermi

Cited by 36 publications

References 4 publications

Parallel evaluation of nonseparable functions by evolutionary algorithms on GPU

Parallel evaluation of nonseparable functions by evolutionary algorithms on GPU

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications

Characterizing and improving the use of demand-fetched caches in GPUs

Contact Info

Product

Resources

About