Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 2022
DOI: 10.1145/3503222.3507767
|View full text |Cite
|
Sign up to set email alerts
|

A full-stack search technique for domain optimized deep learning accelerators

Abstract: The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator Search Technique (FAST), a hardware accelerator search framework that defines a broad optimization environment covering key design decisions within the hardware-software stack, including hardware datapath, software scheduling, and compiler passes such as operation fusion and tensor padding. In this paper, we analyze b… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 31 publications
(18 citation statements)
references
References 76 publications
(103 reference statements)
0
18
0
Order By: Relevance
“…However, both of the works [23], [24] target 16-bit inference, while modern DNN accelerators are mainly using 8-bit precision [1]. Towards the optimization of DNN accelerators, the work in [25] presents a full-stack accelerator search technique which improves the performance per thermal design power ratio. The work in [26] transforms convolutional and fully-connected DNN layers to achieve higher performance in terms of FLOPs/sec.…”
Section: Related Workmentioning
confidence: 99%
“…However, both of the works [23], [24] target 16-bit inference, while modern DNN accelerators are mainly using 8-bit precision [1]. Towards the optimization of DNN accelerators, the work in [25] presents a full-stack accelerator search technique which improves the performance per thermal design power ratio. The work in [26] transforms convolutional and fully-connected DNN layers to achieve higher performance in terms of FLOPs/sec.…”
Section: Related Workmentioning
confidence: 99%
“…Prior ADL-based Design Methods: Prior NPU design approaches are mostly ADL-based, as in they define an architecture template [2], [32]- [35] in an architecture description language (ADL) [36], and build a system stack around it [3]. 1 For a template, the architecture is fixed, i.e., what kind of computation and memory units are interconnected and how.…”
Section: A Npu Design Requirements and Challengesmentioning
confidence: 99%
“…An architectural template for an NPU specifies what kinds of computational and memory units can be interconnected and how. Various system stack tools for the NPU, such as cost models, simulators, and compilers, are developed manually by experts, limiting support to only the template architecture [1], [3], [4]. As workloads evolve or application requirements become stringent, novel architectural features need to be integrated and explored [5].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…This enables efficient support of group or depth-wise convolution on top of the commonly used channel direction only. Many architectures do not support depth-wise convolution efficiently resulting in significant execution time increase [19], [20]. 2) This processor also has a transpose engine and a vector engine with N -dimension indexing to support tensor manipulations and the various vector operations required by deep learning models.…”
Section: Introductionmentioning
confidence: 99%