Learning to optimize halide with tree search and random programs

Adams, Andrew; Ma, Karima; Anderson, Luke; Baghdadi, Riyadh; Li, Tzu‐Mao; Gharbi, Michaël; Steiner, Benoit; Johnson, Steven D.; Fatahalian, Kayvon; Durand, Frédo; Ragan-Kelley, Jonathan

doi:10.1145/3306346.3322967

Cited by 164 publications

(179 citation statements)

References 19 publications

(26 reference statements)

Supporting

Mentioning

176

Contrasting

Order By: Relevance

“…We use a 3.4 GHz, quad-core Intel i5-4670 CPU with 16GB RAM and two GPUs (each experiment uses a single GPU): an NVIDIA GTX 1080Ti and an NVIDIA Tesla V100 (Table 1 lists their key specifications). For our benchmarks, we use six canonical image processing applications that have appeared in prior work [6,14,18,19,22]. Table 2 reports the number of stages and the size of the input image for each benchmark.…”

Section: Discussionmentioning

confidence: 99%

“…The final problem involves choosing tile and block sizes. We present an automatic fusion algorithm that considers key factors affecting the performance of GPU kernels which are not considered in previous work [6,17,18]: 1) number of global memory transactions, 2) achieved and theoretical occupancy, 3) GPU resource usage, and 4) fraction of overlapping computations.…”

Section: Dynamic Programming Fusionmentioning

confidence: 99%

“…These DSLs allow the programmer to write independent stages in a natural way, but still get high-performance code by applying key optimizations, including loop fusion and overlapped tiling. Loop fusion allows the program to exploit locality, and is performed on the basis of a schedule that is either specified by an expert [22,23] or automatically generated using heuristics [6,14,17,18]. After loop fusion, overlapped tiling [19,22,23] splits each stage into overlapping regions (known as tiles) that can be processed in parallel without synchronization with other tiles.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs

Jangda

Guha

2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Domain-specific languages that execute image processing pipelines on GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have limitations: 1) they require intra thread block synchronization, which has a nontrivial cost, 2) they must choose between small tiles that require more overlapped computations or large tiles that increase shared memory access (and lowers occupancy), and 3) their autoscheduling algorithms use simplified GPU models that can result in inefficient global memory accesses. We present a new approach for executing image processing pipelines on GPUs that addresses these limitations as follows. 1) We fuse loops to form overlapped tiles that fit in a single warp, which allows us to use lightweight warp synchronization. 2) We introduce hybrid tiling, which stores overlapped regions in a combination of thread-local registers and shared memory. Thus hybrid tiling either increases occupancy by decreasing shared memory usage or decreases overlapping computations using larger tiles. 3) We present an automatic loop fusion algorithm that considers several factors that affect the performance of GPU kernels. We implement these techniques in PolyMage-GPU, which is a new GPU backend for PolyMage. Our approach produces code that is faster than Halide's manual schedules: 1.65× faster on an NVIDIA GTX 1080Ti and 1.33× faster on an NVIDIA Tesla V100. CCS CONCEPTS • Software and its engineering → Compilers.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Dynamic Programming Fusionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs

Jangda

Guha

2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…Auto-Tuning approaches including Halide's auto-tuners [1,23], OpenTuner [2], ATF [24], and program synthesis techniques such as SwizzleInventor [20] aim to automatically develop optimized code using design space exploration. We aim to automatically synthesize Fireiron strategies in the future but in its current version it is designed as a tool for human performance experts.…”

Section: Related Workmentioning

confidence: 99%

Fireiron

Hagedorn

Elliott²,

Barthels

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

High GPU performance can only be achieved if a kernel efficiently uses the multi-layered compute and memory hierarchies. For example, accelerators such as NVIDIA's Tensor Cores require specific mappings of threads to data that must be considered in data movements to and from registers. Current compilers struggle to match the performance of vendor libraries like cuBLAS, which are developed by experts in assembly. This manual low-level coding is time-consuming and complicates to unlock the full GPU potential, preventing experimentation to achieve even higher performance. In this paper we introduce Fireiron, a scheduling language aimed at performance experts. Fireiron provides high-level abstractions for expressing GPU optimizations that are unavailable to compilers today and which so far must be written in assembly. Our innovation is that both computations and data movements are first class concepts that can be separately mapped to threads, as required for the efficient use of specialized hardware like Tensor Cores. We evaluate Fireiron on three GPU architectures against expertwritten advanced matrix multiplications. First, we show that Fireiron schedules are able to express the strategies of these implementations requiring about 6× less lines of code. Second, we show that the code generated by Fireiron schedules outperforms the fastest implementations (provided by cuBLAS) by more than 2×.

show abstract

“…Recent research shows growing interest in automatic whole-program optimization techniques [18][19][20], but approaches are preliminary and typically focus on optimizing only one aspect of a program at a time. There is no doubt that multi-dimensional whole program optimization is a hard task, but we can perhaps take some hope from the recent success of hybrid search/learning approaches such as AlphaGo [21] that show promise in finding good solutions within huge combinatorial search spaces.…”

Section: Manual Vs Automatic Search Strategiesmentioning

confidence: 99%

Machine Learning Systems are Stuck in a Rut

Barham

Isard

2019

Proceedings of the Workshop on Hot Topics in Operating Systems

View full text Add to dashboard Cite

In this paper we argue that systems for numerical computing are stuck in a local basin of performance and programmability. Systems researchers are doing an excellent job improving the performance of 5-year-old benchmarks, but gradually making it harder to explore innovative machine learning research ideas. We explain how the evolution of hardware accelerators favors compiler back ends that hyper-optimize large monolithic kernels, show how this reliance on highperformance but inflexible kernels reinforces the dominant style of programming model, and argue these programming abstractions lack expressiveness, maintainability, and modularity; all of which hinders research progress. We conclude by noting promising directions in the field, and advocate steps to advance progress towards high-performance general purpose numerical computing systems on modern accelerators.

show abstract

Learning to optimize halide with tree search and random programs

Cited by 164 publications

References 19 publications

Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs

Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs

Fireiron

Machine Learning Systems are Stuck in a Rut

Contact Info

Product

Resources

About