End-to-End Optimization of Deep Learning Applications

Sohrabizadeh, Atefeh; Wang, Jie; Cong, Jason

doi:10.1145/3373087.3375321

Cited by 37 publications

(13 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, they assume that the performance/area changes monotonically by modifying an individual design parameter, which is not a valid assumption as we explained in Challenge 2 of Section 1. To increase the accuracy of the estimation model, a number of other studies restrict the target application to those that have a welldefined accelerator micro-architecture template [9,14,15,40,45,58], a specific application [55,61], or a particular computation pattern [10,28,37]; hence, they lose generality.…”

Section: Model-based Techniquesmentioning

confidence: 99%

“…The main enabler of this feature is the ability to iteratively re-optimize the micro-architecture quickly just by inserting synthesis directives in the form of pragmas instead of re-writing the low-level behavioral description of the design. Because of the reduced code development cycle and the shorter turn-around times, HLS has been rapidly adopted by both academia and industry [3,20,30,45,49,65]. In fact, Code 1 shows an intuitive HLS C implementation of one forward path of a Convolutional Neural Network (CNN) on Xilinx FPGAs.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators

Sohrabizadeh

Gao³

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

Self Cite

View full text Add to dashboard Cite

Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS) , accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. To address this problem, we propose an automated DSE framework— AutoDSE —that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point. AutoDSE detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it. The experimental results show that AutoDSE is able to identify the design point that achieves, on the geometric mean, 19.9× speedup over one CPU core for MachSuite and Rodinia benchmarks. Compared to the manually optimized HLS vision kernels in Xilinx Vitis libraries, AutoDSE can reduce their optimization pragmas by 26.38× while achieving similar performance. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers to design efficient FPGA accelerators.

show abstract

Section: Model-based Techniquesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators

Sohrabizadeh

Gao³

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Cafeine [39] combined both on-chip and of-chip data reorganizations for the convolutional matrixmultiplication representation to maximize the underlying memory bandwidth utilization. FlexCNNe [26] further optimized data layout optimizations on the concatenation layers. However, all these works [6]- [26] are based on the computation and memory access pattern in the inference phase which only has FP.…”

Section: Related Workmentioning

confidence: 99%

“…FlexCNNe [26] further optimized data layout optimizations on the concatenation layers. However, all these works [6]- [26] are based on the computation and memory access pattern in the inference phase which only has FP. The training phase involves FP, BP, and WU where their data access pattern for output features, input features, and weights are diferent.…”

Section: Related Workmentioning

confidence: 99%

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

Tang

Zhang

Zhou

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or new users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward, backward propagation, and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09 GFLOPS/W in terms of throughput and energy efficiency, respectively.

show abstract

“…Contrary, FINN uses an High-level Synthesis (HLS) hardware library [20] of hardware layers and components that are used to generate streaming architectures customized for each network. Other tools for automatic hardware generation are FlexCNN [21], integrating an FPGA implementation framework into Tensorflow and DNNBuilder [22], which uses software-hardware co-design to perform an end-to-end optimization of deep learning applications.…”

Section: B Automatic Hardware Generation and Hardware Architectures F...mentioning

confidence: 99%

HALF: Holistic Auto Machine Learning for FPGAs

Ney,

Loroch,

Rybalkin

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep Neural Networks (DNNs) are capable of solving complex problems in domains related to embedded systems, such as image and natural language processing. To efficiently implement DNNs on a specific FPGA platform for a given cost criterion, e.g. energy efficiency, an enormous amount of design parameters has to be considered from the topology down to the final hardware implementation. Interdependencies between the different design layers have to be taken into account and explored efficiently, making it hardly possible to find optimized solutions manually. An automatic, holistic design approach can improve the quality of DNN implementations on FPGA significantly. To this end, we present a cross-layer design space exploration methodology. It comprises optimizations starting from a hardwareaware topology search for DNNs down to the final optimized implementation for a given FPGA platform. The methodology is implemented in our Holistic Auto machine Learning for FPGAs (HALF) framework, which combines an evolutionary search algorithm, various optimization steps and a library of parametrizable hardware DNN modules. HALF automates both the exploration process and the implementation of optimized solutions on a target FPGA platform for various applications. We demonstrate the performance of HALF on a medical use case for arrhythmia detection for three different design goals, i.e. low-energy, low-power and high-throughput respectively. Our FPGA implementation outperforms a TensorRT optimized model on an Nvidia Jetson platform in both throughput and energy consumption.

show abstract

End-to-End Optimization of Deep Learning Applications

Cited by 37 publications

References 16 publications

AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators

AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

HALF: Holistic Auto Machine Learning for FPGAs

Contact Info

Product

Resources

About