The Deep Learning Compiler: A Comprehensive Survey

Li, Mingzhen; Liu, Yi; Liu, Xiaoyan; Sun, Qingxiao; You, Xiao‐Zeng; Yang, Hailong; Luan, Zhongzhi; Gan, Lin; Yang, Guangwen; Qian, Depei

doi:10.1109/tpds.2020.3030548

Cited by 122 publications

(52 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The evaluated memory sizes were 512 KiB and 256 MiB. The last configuration does not require any tiling to take place, while the first is the smallest size which is supported by the implementing tiling methods for this network 26 .…”

Section: Discussionmentioning

confidence: 99%

“…It is currently not possible to connect the BYOC flow with the micro-TVM runtime that is also still under development. This prevents the usage of TVM on (heterogeneous) embedded devices for 26 For convolutional layers only a split along the output channel dimension was implemented, as the splitting along the rows and columns requires extensive effort to implement and validate all edge cases that can occur TinyML applications, however, it can already be utilized during the hardware development to evaluate the performance of prototypes with real-world test cases.…”

Section: Discussionmentioning

confidence: 99%

“…Two recent studies focus on the employed techniques and the use with FPGA platforms [43], as well as an in-depth overview over the different approaches for common problems of the available deep learning compilers [26]. In contrast to these publications, this work focuses more on embedded platforms.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Compiler Toolchains for Deep Learning Workloads on Embedded Platforms

Sponner,

Waschneck,

Kumar

2021

Preprint

View full text Add to dashboard Cite

As the usage of deep learning becomes increasingly popular in mobile and embedded solutions, it is necessary to convert the framework-specific network representations into executable code for these embedded platforms. This paper consists of two parts: The first section is made up of a survey and benchmark of the available open source deep learning compiler toolchains, which focus on the capabilities and performance of the individual solutions in regard to targeting embedded devices and microcontrollers that are combined with a dedicated accelerator in a heterogeneous fashion. The second part explores the implementation and evaluation of a compilation flow for such a heterogeneous device and reuses one of the existing toolchains to demonstrate the necessary steps for hardware developers that plan to build a software flow for their own hardware.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Compiler Toolchains for Deep Learning Workloads on Embedded Platforms

Sponner,

Waschneck,

Kumar

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This allows better utilization (faster execution, lower energy consumption) of the target hardware. A detailed survey of the work is presented in [9].…”

Section: Related Workmentioning

confidence: 99%

Autotuning LSTM for Accelerated Execution on Edge

Saluja

Mitra

Deshwal

et al. 2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

Deployment of Deep Neural Networks (DNNs) on edge devices is highly desirable to address user privacy concerns and minimize the turnaround time of AI applications. However, the execution of DNN models on a battery-operated device requires a highly optimized implementation specific to the target hardware. Moreover, as different layers of a DNN exhibit distinct computation and memory characteristics, it is imperative to optimize each layer separately. This is in contrast to the widely deployed librarybased approach where all the configurations of DNN operations share the same implementation. In this paper, we address this issue by auto-tuning the implementation of Long Short Term Memory (LSTM) operations which are widely used in sequence based AI applications. To exhaustively search through the space of optimizations and its parameters, we develop a high-level autotuning framework based on Halide. We use grid search to find the parameters that lead to minimum runtime and further present TPE based search method to find the near-optimal runtime in a limited number of trials. We observe 2.2 × −3.1× speedup in execution time for LSTM layers used in widely deployed GNMT and DeepSpeech2 models.

show abstract

“…A compiler takes the DL models from DL frameworks (e.g., Tensorflow [1], Mxnet [5], Pytorch [22]) as input. It converts the model into multiple level of intermediate representations (IRs), and then automatically applies various performance optimizations regarding the model characteristics and underlying hardware in order to generate high-performant model codes [19]. Although different design philosophies have been adopted in different compilers, the fundamental procedures to generate efficient model codes are similar.…”

Section: Introductionmentioning

confidence: 99%

FamilySeer: Towards Optimized Tensor Codes by Exploiting Computation Subgraph Similarity

Zhang¹,

Li²,

Yang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Deploying various deep learning (DL) models efficiently has boosted the research on DL compilers. The difficulty of generating optimized tensor codes drives DL compiler to ask for the auto-tuning approaches, and the increasing demands require increasing autotuning efficiency and quality. Currently, the DL compilers partition the input DL models into several subgraphs and leverage the autotuning to find the optimal tensor codes of these subgraphs. However, existing auto-tuning approaches usually regard subgraphs as individual ones and overlook the similarities across them, and thus fail to exploit better tensor codes under limited time budgets.We propose FamilySeer, an auto-tuning framework for DL compilers that can generate better tensor codes even with limited time budgets. FamilySeer exploits the similarities and differences among subgraphs can organize them into subgraph families, where the tuning of one subgraph can also improve other subgraphs within the same family. The cost model of each family gets more purified training samples generated by the family and becomes more accurate so that the costly measurements on real hardware can be replaced with the lightweight estimation through cost model. Our experiments show that FamilySeer can generate model codes with the same code performance more efficiently than state-of-the-art auto-tuning frameworks.

show abstract

The Deep Learning Compiler: A Comprehensive Survey

Cited by 122 publications

References 33 publications

Compiler Toolchains for Deep Learning Workloads on Embedded Platforms

Compiler Toolchains for Deep Learning Workloads on Embedded Platforms

Autotuning LSTM for Accelerated Execution on Edge

FamilySeer: Towards Optimized Tensor Codes by Exploiting Computation Subgraph Similarity

Contact Info

Product

Resources

About