Simba

Shao, Yakun Sophia; Clemons, Jason; Venkatesan, Rangharajan; Zimmer, Brian; Fojtik, Matthew; Jiang, Nan; Keller, Ben; Klinefelter, Alicia; Pinckney, Nathaniel; Raina, Priyanka; Tell, Stephen G.; Zhang, Yanqing; Dally, William J.; Emer, Joel; Gray, C. Thomas; Khailany, Brucek; Keckler, Stephen W.

doi:10.1145/3352460.3358302

Cited by 232 publications

(67 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• For networks with low operation count per layer (e.g., ResNet and AlexNet), they can benefit a lot from layer fusion provided by our work (Comparing to the rulesbased fusion in TensorRT, which means one fusion block contain only one CONV at most). • For mobileNet, although it has low operation count per layer, it uses depthwise separable convolution with different access patterns with low data reuse (Shao et al 2019;Sandler et al 2018) compared to ordinary convolution, which need manually tunning of even lowerlevel code than provided SDK. So, the fine-tunned Ten-sorRT outperforms our work on mobileNet.…”

Section: Discussionmentioning

confidence: 99%

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Liu

Leng

et al. 2020

CCF Trans. HPC

View full text Add to dashboard Cite

Specialized hardware accelerators for deep learning are widely introduced by many hardware vendors because of their high performance and efficiency. However, different vendors adopt different accelerator architectures, making it challenging for the compiler tool-chain to generate and optimize high-performance codes. Moreover, the current tool-chains provided by the vendors are either highly abstract, which makes it hard to optimize or contain too many hardware-related details, which makes it inconvenient to program. So, in this paper, we propose a middle layer compiler tool-chain for Cambricon MLU-100 to fill the gap between high-level runtime library and low operator-level SDK. Our tool-chain is based on the operator level SDK but abstracts away its redundant initialization and allocation statement. We also expose the interface of major optimization knobs compared to the existing runtime, thus enabling a considerable optimization space. We evaluate our work by several state-of-the-art neural networks and choose the line of code and optimization knobs as evaluation metrics. We also compare the performance against state-of-the-art tool-chain TensorRT applying simple optimization strategy and find that our work has great potential in optimization. Our work can guarantee the user a vast optimization space with only around 20% amount of the codes that hides the redundant initialization and allocation statements from users.

show abstract

Section: Discussionmentioning

confidence: 99%

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Liu

Leng

et al. 2020

CCF Trans. HPC

View full text Add to dashboard Cite

show abstract

“…There has been an incredible amount of interest in DNN hardware acceleration. Broadly speaking, the architecture community has focused on designing efficient dataflows to maximize local reuse of data and functional unit utilization [4,10,11,15,28,34,37,39], explore the space of possible dataflows and mappings [26,45,74], exploit model sparsity and data quantization [17,21,29,38,46,53,71,73,78], map DNN accelerators to FPGAs [20,66,69], and explore alternative compute, memory, and packaging technologies [35,58,59,67]. All of these works are highly relevant to this field.…”

Section: Related Workmentioning

confidence: 99%

Smaug

Yao

Bhardwaj

et al. 2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

In recent years, there has been tremendous advances in hardware acceleration of deep neural networks. However, most of the research has focused on optimizing accelerator microarchitecture for higher performance and energy efficiency on a per-layer basis. We find that for overall single-batch inference latency, the accelerator may only make up 25–40%, with the rest spent on data movement and in the deep learning software framework. Thus far, it has been very difficult to study end-to-end DNN performance during early stage design (before RTL is available), because there are no existing DNN frameworks that support end-to-end simulation with easy custom hardware accelerator integration. To address this gap in research infrastructure, we present SMAUG, the first DNN framework that is purpose-built for simulation of end-to-end deep learning applications. SMAUG offers researchers a wide range of capabilities for evaluating DNN workloads, from diverse network topologies to easy accelerator modeling and SoC integration. To demonstrate the power and value of SMAUG, we present case studies that show how we can optimize overall performance and energy efficiency for up to 1.8×–5× speedup over a baseline system, without changing any part of the accelerator microarchitecture, as well as show how SMAUG can tune an SoC for a camera-powered deep learning pipeline.

show abstract

“…Image classification applications widely use deep convolutional neural networks (CNNs) and are deployed from cloud to edge computational frameworks for varieties of scenarios, such as search engines and self-driving cars [1,2,3,4,5,6]. As the complexity of these applications and the resolution of images continue to increase, conventional homogeneous architectures (such as multi-core CPU/GPU) are constrained due to an excessive long latency and significant power dissipation [7,8,9]. To efficiently process these applications, heterogeneous architectures have been proposed with pre-processing and inference cores [7,8,9,10,11,12,13].…”

Section: Introductionmentioning

confidence: 99%

“…As the complexity of these applications and the resolution of images continue to increase, conventional homogeneous architectures (such as multi-core CPU/GPU) are constrained due to an excessive long latency and significant power dissipation [7,8,9]. To efficiently process these applications, heterogeneous architectures have been proposed with pre-processing and inference cores [7,8,9,10,11,12,13].…”

Section: Introductionmentioning

confidence: 99%

A Technique for Approximate Communication in Network-on-Chips for Image Classification

Chen¹,

Liu²,

Lombardi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Approximation is an effective technique for reducing power consumption and latency of on-chip communication in many computing applications. However, existing approximation techniques either achieve modest improvements in these metrics or require retraining after approximation, such when convolutional neural networks (CNNs) are employed. Since classifying many images introduces intensive on-chip communication, reductions in both network latency and power consumption are highly desired. In this paper, we propose an approximate communication technique (ACT) to improve the efficiency of on-chip communications for image classification applications. The proposed technique exploits the error-tolerance of the image classification process to reduce power consumption and latency of on-chip communications, resulting in better overall performance for image classification computation. This is achieved by incorporating novel quality control and data approximation mechanisms that reduce the packet size. In particular, the proposed quality control mechanisms identify the error-resilient variables and automatically adjust the error thresholds of the variables based on the image classification accuracy. The proposed data approximation mechanisms significantly reduce packet size when the variables are transmitted. The proposed technique reduces the number of flits in each data packet as well as the on-chip communication, while maintaining an excellent image classification accuracy. The cycle-accurate simulation results show that ACT achieves 23% in network latency reduction and 24% in dynamic power reduction compared to the existing approximate communication technique with less than 0.99% classification accuracy loss.

show abstract

Simba

Cited by 232 publications

References 40 publications

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Smaug

A Technique for Approximate Communication in Network-on-Chips for Image Classification

Contact Info

Product

Resources

About