StreamJIT

Bosboom, Jeffrey; Rajadurai, Sumanaruban; Wong, Weng Fai; Amarasinghe, Saman

doi:10.1145/2714064.2660236

Cited by 4 publications

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Autotuning CUDA compiler parameters for heterogeneous applications using the OpenTuner framework

Bruel

Amarís

Goldman

2017

Concurrency and Computation

View full text Add to dashboard Cite

A Graphics Processing Unit (GPU) is a parallel computing coprocessor specialized in accelerating vector operations. The enormous heterogeneity of parallel computing platforms justifies and motivates the development of automated optimization tools and techniques. The Algorithm Selection Problem consists in finding a combination of algorithms, or a configuration of an algorithm, that optimizes the solution of a set of problem instances. An autotuner solves the Algorithm Selection Problem using search and optimization techniques.In this paper, we implement an autotuner for the Compute Unified Device Architecture compiler's parameters using the OpenTuner framework. The autotuner searches for a set of compilation parameters that optimizes the time to solve a problem. We analyze the performance speedups, in comparison with high-level compiler optimizations, achieved in three different GPU devices, for 17 heterogeneous GPU applications, 12 of which are from the Rodinia Benchmark Suite. The autotuner often beats the compiler's high-level optimizations, but underperformed for some problems. We achieved over 2x speedup for Gaussian Elimination and almost 2x speedup for Heart Wall, both problems from the Rodinia Benchmark, and over 4x speedup for a matrix multiplication algorithm. problem defines a search space. Various optimization techniques search this space, guided by the performance metrics, for the algorithm or configuration that best solve the problem.There are specialized autotuners for domains such as matrix multiplication [2], dense [3], or sparse [4] matrix linear algebra and parallel programming [5]. Other autotuning frameworks provide more general tools for the representation and search of program configurations, enabling the implementation of autotuners for different problem domains [6,7].The OpenTuner framework [6] provides tools for the implementation of autotuners for various problem domains. It implements different search techniques that explore the same search space for program optimizations. Running and measuring program execution time -that is, the empirical exploration of the search space -is done sequentially. The framework also provides support for parallel compilation.In this paper, we implemented an autotuner for the CUDA compiler using the use the OpenTuner framework [6] and used it to search for the compilation parameters that optimize the performance of 17 heterogeneous GPU applications, 12 of which are from the Rodinia Benchmark Suite [8]. We used three different NVIDIA GPUs in the experiments, the Tesla K40, the GTX 980, and the GTX 750.Our main contribution is to show that it is possible to optimize code written for GPUs by automatically tuning just the parameters of the CUDA compiler. We propose a thorough methodology for analyzing result correctness and checking for invalid flag combinations and compilation errors. The optimization achieved by autotuning often beats the compiler high-level optimization options, such as -O1, -O2, and -O3. The autotuner found compilation options that achieved...

show abstract

Autotuning CUDA compiler parameters for heterogeneous applications using the OpenTuner framework

Bruel

Amarís

Goldman

2017

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

A Backend Extension Mechanism for PQL/Java with Free Run-Time Optimisation

Ackermann

Reichenbach

Müller

et al. 2015

Lecture Notes in Computer Science

View full text Add to dashboard Cite

RaftLib

Beard

Chamberlain

2015

Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores

View full text Add to dashboard Cite

Stream processing or data-flow programming is a compute paradigm that has been around for decades in many forms yet has failed garner the same attention as other mainstream languages and libraries (e.g., C++ or OpenMP [15]). Stream processing has great promise: the ability to safely exploit extreme levels of parallelism. There have been many implementations, both libraries and full languages. The full languages implicitly assume that the streaming paradigm cannot be fully exploited in legacy languages, while library approaches are often preferred for being integrable with the vast expanse of legacy code that exists in the wild. Libraries, however are often criticized for yielding to the shape of their respective languages. RaftLib aims to fully exploit the stream processing paradigm, enabling a full spectrum of streaming graph optimizations while providing a platform for the exploration of integrability with legacy C/C++ code. RaftLib is built as a C++ template library, enabling end users to utilize the robust C++ standard library along with RaftLib's pipeline parallel framework. RaftLib supports dynamic queue optimization, automatic parallelization, and real-time low overhead performance monitoring.

show abstract

StreamJIT

Cited by 4 publications

References 32 publications

Autotuning CUDA compiler parameters for heterogeneous applications using the OpenTuner framework

Autotuning CUDA compiler parameters for heterogeneous applications using the OpenTuner framework

A Backend Extension Mechanism for PQL/Java with Free Run-Time Optimisation

RaftLib

Contact Info

Product

Resources

About