Autotuning CUDA compiler parameters for heterogeneous applications using the OpenTuner framework

Bruel, Pedro; Amarís, Marcos; Goldman, Alfredo

doi:10.1002/cpe.3973

Cited by 11 publications

(11 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We employ the CLTune program that CLBlast uses for tuning XgemmDirect, and we implement the OpenTuner program for this kernel according to the work of Bruel et al, where we use the unconstrained search space; we report a penalty value in case of a configuration for which XgemmDirect's constraints are not satisfied.…”

Section: Resultsmentioning

confidence: 99%

“…Due to these dependencies, OpenTuner is not capable of auto‐tuning GEMM. For this restriction, the OpenTuner community has offered workarounds, eg, re‐designing the user program so that its tuning parameters become independent or setting a penalty value for configurations where the constraints are not met . However, the first workaround usually requires a significant effort from the user, while the second may cause a poor tuning result as we demonstrate for GEMM in Section 7.…”

Section: Motivation and Related Workmentioning

confidence: 99%

“…In contrast, CLTune is only suitable for auto-tuning programs written in OpenCL and only in terms of runtime performance. ATF provides OpenCL-specific tuning directives for auto-tuning OpenCL programs (Listing 2, line [11][12][13][14][15][16][17][18][19][20]. We argue that the usage of ATF for OpenCL is better than CLTune due to the following reasons.…”

Section: Comparison: Atf Versus Cltunementioning

confidence: 99%

“…configurations where the constraints are not met. 12 However, the first workaround usually requires a significant effort from the user, while the second may cause a poor tuning result as we demonstrate for GEMM in Section 7. Moreover, OpenTuner is optimized for programs whose tuning parameters have large ranges, and thus, it does not provide search techniques for small ranges, eg, exhaustive search that finds the probably best result.…”

mentioning

confidence: 93%

“…shows the OpenTuner program for auto-tuning GCC's optimization options for raytracer. The user defines one tuning parameter per option (line[8][9][10][11][12][13][14][15][16][17][18][19][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38], -322 in total -by overriding the manipulator function (line 23) of OpenTuner's class MeasurementInterface (line 21) and by using a so-called OpenTuner configuration manipulator (line 28); The OpenTuner provides a straightforward Python script for extracting the GCC's options used in line[8][9][10][11][12][13][14][15][16][17][18][19]. To define the cost function, the user overrides OpenTuner's run function (line 40-61); in this function, he has to explicitly construct a GCC command with the optimization options according to the input configuration (line 45-55).…”

mentioning

confidence: 99%

See 4 more Smart Citations

ATF: A generic directive‐based auto‐tuning framework

Rasch

Gorlatch

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary We describe the Auto‐Tuning Framework (ATF) — a simple‐to‐use, generic approach and its implementation, as a framework for automatic program optimization by choosing the most suitable values of program parameters such as the number of parallel threads, tile sizes, etc. ATF combines four major advantages over the state‐of‐the‐art auto‐tuning: i) it is generic regarding the programming language, application domain, tuning objective (eg, high performance and/or low energy consumption), and search technique; ii) it can auto‐tune a broader class of applications by allowing tuning parameters to be interdependent, eg, when one parameter is divisible by another parameter; iii) it allows tuning parameters to have substantially larger ranges by implementing an optimized search space generation process; and iv) it is arguably simpler to use, eg, the ATF user prepares an application for auto‐tuning by annotating its source code with simple tuning directives. We demonstrate ATF's efficacy by comparing it to the state‐of‐the‐art auto‐tuning approaches, OpenTuner and CLTune; ATF shows better tuning results with less programmer's effort.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Motivation and Related Workmentioning

confidence: 99%

Section: Comparison: Atf Versus Cltunementioning

confidence: 99%

mentioning

confidence: 93%

mentioning

confidence: 99%

See 3 more Smart Citations

ATF: A generic directive‐based auto‐tuning framework

Rasch

Gorlatch

2018

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

Computer architecture and high performance computing

Goldman

Arantes

Moreno

2017

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

This special issue of Concurrency and Computation Practice and Experience gathers eleven selected research articles that were previously presented at the Brazilian "XVII Simpósio em Sistemas Computacionais de Alto Desempenho," WSCAD 2016, held in conjunction with 28th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2015, Florianópolis, SC, Brazil, from the 19th to the 21st October 2015. Since 2000, this workshop has presented important and interesting research in the fields of computer architectures, high performance computing, and distributed systems.The scope of the current special issue is broad and representative of the multidisciplinary nature of high performance and distributed computing, covering a wide range of subjects such as architecture issues, compiler optimization, analysis of HPC applications, job scheduling, and energy efficiency.The title of the first paper is "An efficient virtual system clock for the wireless Raspberry Pi computer platform," by Diego L. C. Dutra, Edilson C. Corrêa, and Claudio L. Amorim [1]. In this paper, the authors present the design and experimental evaluation of an implementation of the RVEC virtual system clock in the Linux kernel for the EE (Energy-Efficient) Wireless Raspberry Pi (RasPi) platform. In the RasPi platform, the use of DVFS (Dynamic Voltage and Frequency) for reducing the energy consumption hinders the direct use of the cycle count of the ARM11 processor core for building an efficient system clock. Therefore, a distinct feature of RVEC is to obviate this obstacle, such that it can make use of the cycle count circuit for precise and accurate time measurements, concurrently with the use of DVFS by the operating system of the ARM11 processor core.In the second contribution, entitled "Portability with efficiency of the advection of BRAMS between multi-core and many-core architectures," the authors, Manoel Baptista Silva Junior, Jairo Panetta, and Stephan Stephany [2], show the feasibility of writing a single portable code embedding both interfaces (the OpenMP programming interface and OpenACC). It presents acceptable efficiency when executed on nodes with multi-core or many-core architecture. The code chosen as a case study is the advection of scalars, a part of the dynamics of the regional atmospheric model Brazilian Regional Atmospheric Modeling System (BRAMS). The dynamics of this model is hard to parallelize due to data dependencies between adjacent grid points. Single-node executions of the advections of scalars for different grid sizes using OpenMP or OpenACC yielded similar speed-ups, showing the feasibility of the proposed approach.In the third contribution, entitled "SMT-based context-bounded model checking for CUDA programs," the authors (Phillipe Pereira, Higo Albuquerque, Isabela da Silva, Hendrio Marques, Felipe Monteiro, Ricardo Ferreira, and Lucas Cordeiro) [3] present the ESBMC-GPU tool, an extension to the Efficient SMT-Based Context-Bounded Model Checker (ESBMC), which is aimed at verifying Graphics ...

show abstract