Auto-tuning a high-level language targeted to GPU codes

Abstract-State-of-the-art mobile system-on-chips (SoC) include heterogeneity in various forms for accelerated and energyefficient execution of diverse range of applications. The modern SoCs now include programmable cores such as CPU and GPU with very different functionality. The SoCs also integrate performance heterogeneous cores with different power-performance characteristics but the same instruction-set architecture such as ARM big.LITTLE. In this paper, we first explore and establish the combined benefits of functional heterogeneity and performance heterogeneity in improving power-performance behavior of data parallel applications. Next, given an application specified in OpenCL, we present a static partitioning strategy to execute the application kernel across CPU and GPU cores along with voltage-frequency setting for individual cores so as to obtain the best power-performance tradeoff. We achieve over 19% runtime improvement by exploiting the functional and performance heterogeneities concurrently. In addition, energy saving of 36% is achieved by using appropriate voltage-frequency setting without significantly degrading the runtime improvement from concurrent execution.

show abstract

“…We use the GPU version of the popular Polybench benchmark suite [10]. This suite contains data-parallel applications written in OpenCL.…”

Section: Benchmark Applicationsmentioning

confidence: 99%

Energy-efficient execution of data-parallel applications on heterogeneous mobile platforms

Prakash

Wang

Irimiea

et al. 2015

2015 33rd IEEE International Conference on Computer Design (ICCD)

View full text Add to dashboard Cite

show abstract

“…Previous studies, like [8,24,13,17,32,14,22,10] also evaluate directive-based compilers that generate code for accelerators. The main difference is that this work covers more programs and includes a study of transformations.…”

Section: Related Workmentioning

confidence: 99%

“…While we perform our evaluation using the Rodinia benchmark suite, which contains applications from different domains, most previous works only experiment with 1 or 2 applications, except for the project discussed by Grauer et al [10] and Lee and Vetter [22]. The work by Grauer et al uses the PolyBench collection which contains regular kernels mostly from the linear algebra domain.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Directive-Based Compilers for GPUs

Ghike

Tejero

Garzarán

et al. 2015

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

Abstract. General Purpose Graphics Computing Units can be effectively used for enhancing the performance of many contemporary scientific applications. However, programming GPUs using machine-specific notations like CUDA or OpenCL can be complex and time consuming. In addition, the resulting programs are typically fine-tuned for a particular target device. A promising alternative is to program in a conventional and machine-independent notation extended with directives and use compilers to generate GPU code automatically. These compilers enable portability and increase programmer productivity and, if effective, would not impose much penalty on performance. This paper evaluates two such compilers, PGI and Cray. We first identify a collection of standard transformations that these compilers can apply. Then, we propose a sequence of manual transformations that programmers can apply to enable the generation of efficient GPU kernels. Lastly, using the Rodinia Benchmark suite, we compare the performance of the code generated by the PGI and Cray compilers with that of code written in CUDA. Our evaluation shows that the code produced by the PGI and Cray compilers can perform well. For 6 of the 15 benchmarks that we evaluated, the compiler generated code achieved over 85% of the performance of a hand-tuned CUDA version.

show abstract

“…Each case calls the portability of accelerator performance into question. The PolyBench/GPU project [46] attempts to address these issues 13 through auto-tuning, establishing the lack of native performance portability in the process. Our work also attempts to address this issue by dynamically assigning appropriate amounts of work regardless of device performance.…”

Section: Early Hardware Asymmetrymentioning

confidence: 99%

Runtime Adaptation for Autonomic Heterogeneous Computing

Scogland

Feng

2014

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

View full text Add to dashboard Cite

Heterogeneity is increasing across all levels of computing, with the rise of accelerators such as GPUs, FPGAs, and other coprocessors into everything from cell phones to supercomputers. More quietly it is increasing with the rise of NUMA systems, hierarchical caching, OS noise, and a myriad of other factors. As heterogeneity becomes a fact of life, efficiently managing heterogeneous compute resources is becoming a critical, and ever more complex, task. The focus of this dissertation is to lay the foundation for an autonomic system for heterogeneous computing, employing runtime adaptation to improve performance portability and performance consistency while maintaining or increasing programmability. We investigate heterogeneity arising from a myriad of factors, grouped into the dimensions of locality and capability. This work has resulted in runtime schedulers capable of automatically detecting and mitigating heterogeneity in physically homogeneous systems through MPI and adaptive coscheduling for physically heterogeneous accelerator based systems as well as a synthesis of the two to address multiple levels of heterogeneity as a coherent whole. We also discuss our current work towards the next generation of fine-grained scheduling and synchronization across heterogeneous platforms in the design of a highly-scalable and portable concurrent queue for many-core systems. Each component addresses aspects of the urgent need for automated management of the extreme and ever expanding complexity introduced by heterogeneity. I have also had the good fortune to collaborate with both Lawrence Livermore National Laboratory and Argonne National Laboratory. Experiences as an intern at each lab has had a significant effect on the final shape of my dissertation. Special thanks go to Dr. Pavan Balaji (again), Dr. Bronis de Supinski (again) and Dr. Barry Rountree. These collaborations proved to be major turning points for me, both in my research and my life as a whole.

show abstract

Auto-tuning a high-level language targeted to GPU codes

Cited by 355 publications

References 28 publications

Energy-efficient execution of data-parallel applications on heterogeneous mobile platforms

Energy-efficient execution of data-parallel applications on heterogeneous mobile platforms

Directive-Based Compilers for GPUs

Runtime Adaptation for Autonomic Heterogeneous Computing

Contact Info

Product

Resources

About