Implementing the Himeno benchmark with CUDA on GPU clusters

Phillips, Everett; Fatica, Massimiliano

doi:10.1109/ipdps.2010.5470394

Cited by 74 publications

(57 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous implementations of stencil computations on GPUs can be grouped into three categories: (1) Hand-coded implementations of a particular stencil that strive to achieve the best performance possible [17,18,20] -but with optimization techniques that may not generalize to other types of stencils -(2) Implementations where ease of programming is the primary goal rather than performance -often with code generators for various stencils [5,22,14,11] and (3) implementations that focus on a particular parameter and study how tuning it can affect performance [13,16].…”

Section: Related Workmentioning

confidence: 99%

“…• 19-Point Stencil (Figure 2(c)): This is also called the Himeno benchmark, the behavior of which is detailed elsewhere [20]. We use the same specification (Table I in [20]), except for ignoring the last line of residual calculation.…”

Section: Design Overviewmentioning

confidence: 99%

“…We use the same specification (Table I in [20]), except for ignoring the last line of residual calculation. All the weights in this benchmark are array parameters, making it a very cache-unfriendly benchmark.…”

Section: Design Overviewmentioning

confidence: 99%

See 2 more Smart Citations

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Zhang

Mueller

2012

Proceedings of the Tenth International Symposium on Code Generation and Optimization

106

View full text Add to dashboard Cite

This paper develops and evaluates search and optimization techniques for auto-tuning 3D stencil (nearest-neighbor) computations on GPUs. Observations indicate that parameter tuning is necessary for heterogeneous GPUs to achieve optimal performance with respect to a search space. Our proposed framework takes a most concise specification of stencil behavior from the user as a single formula, auto-generates tunable code from it, systematically searches for the best configuration and generates the code with optimal parameter configurations for different GPUs. This auto-tuning approach guarantees adaptive performance for different generations of GPUs while greatly enhancing programmer productivity. Experimental results show that the delivered floating point performance is very close to previous handcrafted work and outperforms other auto-tuned stencil codes by a large margin.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Design Overviewmentioning

confidence: 99%

See 1 more Smart Citation

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Zhang

Mueller

2012

Proceedings of the Tenth International Symposium on Code Generation and Optimization

106

View full text Add to dashboard Cite

show abstract

“…The details of the Himeno computation can be found in [20,21]. Figure 6 shows how Himeno is expressed in HiDP.…”

Section: D Stencil Computationmentioning

confidence: 99%

Hidp: A hierarchical data parallel language

Mueller

Zhang

2013

Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

View full text Add to dashboard Cite

Problem domains are commonly decomposed hierarchically to fully utilize parallel resources in modern microprocessors. Such decompositions can be provided as library routines, written by experienced experts, for general algorithmic patterns. But such APIs tend to be constrained to certain architectures or data sizes. Integrating them with application code is often an unnecessarily daunting task, especially when these routines need to be closely coupled with user code to achieve better performance.This paper contributes HiDP, a hierarchical data parallel language. The purpose of HiDP is to improve the coding productivity of integrating hierarchical data parallelism without significant loss of performance. HiDP is a sourceto-source compiler that converts a very concise data parallel language into CUDA C++ source code. Internally, it performs necessary analysis to compose user code with efficient and architecture-aware code snippets. This paper discusses various aspects of HiDP systematically: the language, the compiler and the run-time system with built-in tuning capabilities. They enable HiDP users to express algorithms in less code than low-level SDKs require for native platforms. HiDP also exposes abundant computing resources of modern parallel architectures. Improved coding productivity tends to come with a sacrifice in performance. Yet, experimental results show that the generated code delivers performance very close to handcrafted native GPU code.

show abstract

“…In data parallel applications, it provides a powerful and relatively low cost platform with a potential for significant amount of performance speedup over a traditional CPU approach. CUDA extends C or Fortran by allowing the programmer to define functions, called kernels, that when called are executed on the GPU by potentially thousands of parallel threads [3]. Therefore, there has been an explosion of interest and research in using this platform for high performance computing [4]- [9].…”

Section: A Nvidia Compute Unified Device Architecturementioning

confidence: 99%

Use of CUDA for the Continuous Space Language Model

Thompson

Anderson

2012

2012 IEEE Conference on High Performance Extreme Computing

View full text Add to dashboard Cite

Abstract-The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated.

show abstract

Implementing the Himeno benchmark with CUDA on GPU clusters

Cited by 74 publications

References 1 publication

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Hidp: A hierarchical data parallel language

Use of CUDA for the Continuous Space Language Model

Contact Info

Product

Resources

About