Bring Your Own Codegen to Deep Learning Compiler

Recently, embedded systems, such as mobile platforms, have multiple processing units that can operate in parallel, such as centralized processing units (CPUs) and neural processing units (NPUs). We can use deep‐learning compilers to generate machine code optimized for these embedded systems from a deep neural network (DNN). However, the deep‐learning compilers proposed so far generate codes that sequentially execute DNN operators on a single processing unit or parallel codes for graphic processing units (GPUs). In this study, we propose PartitionTuner, an operator scheduler for deep‐learning compilers that supports multiple heterogeneous PUs including CPUs and NPUs. PartitionTuner can generate an operator‐scheduling plan that uses all available PUs simultaneously to minimize overall DNN inference time. Operator scheduling is based on the analysis of DNN architecture and the performance profiles of individual and group operators measured on heterogeneous processing units. By the experiments for seven DNNs, PartitionTuner generates scheduling plans that perform 5.03% better than a static type‐based operator‐scheduling technique for SqueezeNet. In addition, PartitionTuner outperforms recent profiling‐based operator‐scheduling techniques for ResNet50, ResNet18, and SqueezeNet by 7.18%, 5.36%, and 2.73%, respectively.

show abstract

“…Genesis [26] is a DL compiler that integrates graph partitioning functionalities into TVM. Genesis has a similar structure to NEST-C.…”

Section: Compilersmentioning

confidence: 99%

PartitionTuner: An operator scheduler for deep‐learning compilers supporting multiple heterogeneous processing units

Kwon

Lee

et al. 2023

ETRI Journal

View full text Add to dashboard Cite

show abstract

“…Once flexible matching completes, the extracted rewritten program is translated back to Relay where accelerator instructions are specially annotated. In our prototype, we use TVM's Bring Your Own Codegen (BYOC) interface to implement the generation of those accelerator instructions [16]. BYOC allows for invoking the target interface of a custom execution mechanism (e.g., an accelerator's MMIO loads/stores) by having TVM's runtime defer execution to a user-specified runtime when it reaches an annotated portion of the program.…”

Section: Prototype Implementationmentioning

confidence: 99%

“…In principle, many DSLs allow for supporting custom accelerators via bespoke translations from DSL operators to specific accelerator APIs, e.g., as in the original TVM [14] support for VTA [50]. TVM's BYOC [16] interface eases incorporating custom accelerators by performing syntactic pattern matching to offload computations via user-provided code generators. However, BYOC leaves all matters of code generation, e.g., MMIO invocations, to the user, while D2A provides more structure to code generation via the ILA.…”

Section: Pattern Matching Accelerator Callsmentioning

confidence: 99%

Specialized Accelerators and Compiler Flows: Replacing Accelerator APIs with a Formal Software/Hardware Interface

Huang¹,

Lyubomirsky²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Specialized accelerators are increasingly used to meet the power-performance goals of emerging applications such as machine learning, image processing, and graph analysis. Existing accelerator programming methodologies using APIs have several limitations: (1) The application code lacks portability to other platforms and compiler frameworks; (2) the lack of integration of accelerator code in the compiler limits useful optimizations such as instruction selection and operator fusion; and (3) the opacity of the accelerator function semantics limits the ability to check the final code for correctness. The root of these limitations is the lack of a formal software/hardware interface specification for accelerators.In this paper we use the recently developed Instruction-Level Abstraction (ILA) for accelerators to serve this purpose, similar to how the Instruction Set Architecture (ISA) has been used as the software/hardware interface for processors. We propose a compiler flow termed D2A using the ILA and present a prototype that demonstrates this flow for deep learning (DL) applications. This prototype compiles programs from high-level domain-specific languages, e.g., PyTorch and MxNet, to multiple target accelerators with no target-specific extensions to the application or compilerthus demonstrating application portability. It includes compiler optimizations through instruction selection using equality saturation-based flexible matching. Finally, we demonstrate checking the correctness of the resulting code through both formal verification of individual matched operations, as well as fully automated simulation-based validation of complete applications. The evaluation of the prototype compiler is based on six different DL applications and three different accelerators. Overall, this methodology lays the foundation * Equal contribution arXiv, 2022, USA 2022.for integrating accelerators in compiler flows using a formal software/hardware interface.

show abstract

“…One naïve solution is to develop a full compiler stack from scratch for each hardware, but this does not scale. Bolt addresses this challenge by employing a BYOC (Bring Your Own Codegen) (Chen et al, 2021) approach. It enables us to reuse the existing compiler stacks (e.g., TVM) as much as possible and focus only on the optimization and code generation using templated device libraries.…”

Section: Challenges In Code Generationmentioning

confidence: 99%

“…Traditional BYOC systems (Chen et al, 2021) cannot target code generation in templated format; they treat such libraries as external functions at runtime. In contrast, Bolt produces low-level tensor implementations in the CUTLASS convention by instantiating the templates with the best parameters identified by the profiler.…”

Section: Templated Code Generationmentioning

confidence: 99%

Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

Xing¹,

Wang²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Today's auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor programs by navigating a large search space to identify effective implementations, but they do so with opaque hardware details. Thus, their performance could fall behind that of hardware-native libraries (e.g., cuBLAS, cuDNN), which are hand-optimized by device vendors to extract high performance. On the other hand, these vendor libraries have a fixed set of supported functions and lack the customization and automation support afforded by auto-tuners. Bolt is based on the recent trend that vendor libraries are increasingly modularized and reconfigurable via declarative control (e.g., CUTLASS). It enables a novel approach that bridges this gap and achieves the best of both worlds, via hardware-native templated search. Bolt provides new opportunities to rethink end-to-end tensor optimizations at the graph, operator, and model levels. Bolt demonstrates this concept by prototyping on a popular auto-tuner in TVM and a class of widely-used platforms (i.e., NVIDIA GPUs)-both in large deployment in our production environment. Bolt improves the inference speed of common convolutional neural networks by 2.5x on average over the state of the art, and it auto-tunes these models within 20 minutes.

show abstract

Bring Your Own Codegen to Deep Learning Compiler

Cited by 3 publications

References 17 publications

PartitionTuner: An operator scheduler for deep‐learning compilers supporting multiple heterogeneous processing units

PartitionTuner: An operator scheduler for deep‐learning compilers supporting multiple heterogeneous processing units

Specialized Accelerators and Compiler Flows: Replacing Accelerator APIs with a Formal Software/Hardware Interface

Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

Contact Info

Product

Resources

About