A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing

Wang, Yan; Qin, Yubin; Deng, Dazheng; Wei, Jingchuan; Zhou, Yang; Fan, Yuanqi; Chen, Tianbao; Sun, Hao; Liu, Leibo; Wei, Shaojun; Yin, Shouyi

doi:10.1109/isscc42614.2022.9731686

Cited by 30 publications

(5 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the smallest PE numbers, our design did not have the highest throughput and the corresponding energy efficiency, since our throughput target is to meet the required real-time constraints. Existing transformer-based designs only optimize transformer attention execution by exploiting the sparsity of attention [17]- [19], [21] instead of the whole model as in this work. In addition, our design must optimize for CNN, transformer, and GRU at the same time, which is not addressed in previous designs.…”

Section: Hardware Implementation Resultsmentioning

confidence: 99%

“…This design utilizes a systolic array for swift self-attention computation and extends native support for both LN and softmax operations. On another note, [21] put forth a transformer processor designed to bypass weakly related tokens, targeting enhanced energy efficiency. However, this approach introduces an irregular and intricate computing structure.…”

Section: Deep Learning Acceleratorsmentioning

confidence: 99%

See 1 more Smart Citation

A Low-Power Streaming Speech Enhancement Accelerator for Edge Devices

Wu,

Chang

2024

IEEE Open J. Circuits Syst.

View full text Add to dashboard Cite

Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.

show abstract

Section: Hardware Implementation Resultsmentioning

confidence: 99%

Section: Deep Learning Acceleratorsmentioning

confidence: 99%

A Low-Power Streaming Speech Enhancement Accelerator for Edge Devices

Wu,

Chang

2024

IEEE Open J. Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…In this paper, the T si of the last four stages are set as 0, while the T si of the remaining stages are set as 0.1. Finally, the bit-width is set as [10,10,11,12,12,13,14,15,16]. We utilize the frame-length adaptive MFCC structure, which is proposed in our previous work [27], and the architecture is shown in Fig.…”

Section: Stage-by-stage Bit-width Selection Algorithmmentioning

confidence: 99%

A 3.8-μW 10-Keyword Noise-Robust Keyword Spotting Processor Using Symmetric Compressed Ternary-Weight Neural Networks

Liu,

Xie,

Zhang

et al. 2023

IEEE Open J. Solid-State Circuits Soc.

View full text Add to dashboard Cite

A ternary-weight neural network (TWN) inspired keyword spotting (KWS) processor is proposed to support complicated and variable application scenarios. To achieve high-precision recognition of 10 keywords under 5dB∼Clean wide range of background noises, a convolution neural network consists of 4 convolution layers and 4 fully connected layers, with modified sparsity-controllable Truncated Gaussian Approximation based ternary-weight training is used. End to end optimization composed of three techniques are utilized: 1) the stage-by-stage bit-width selection algorithm to optimize the hardware overhead of FFT; 2) the lossy compressed TWN with symmetric kernel training (SKT) and dedicated internal data reuse computation flow; 3) the error inter-compensation approximate addition tree to reduce the computation overhead with marginal accuracy loss. Fabricated in an industrial 22nm CMOS process, the processor realizes up to 10 keywords real-time recognition under 11 background noise types, with the accuracy of 90.6%@clean and 85.4%@5dB. It consumes an average power of 3.8 µW at 250KHz and the normalized energy efficiency is 2.79× higher than state-of-the-art.

show abstract

“…Since adjacent frames share similar information, efficiently leverage video temporal correlations to minimize the computing costs for video model is worth exploring. In ISSCC'20, Yuan [8] proposes an inter-frame data-reuse processors for video accelerating. Other than directly inputting the original frames, the work processes the difference feature between two frames in each CNN layer to reduce the redundant computation.…”

Section: Ai Chips For Image or Video Processingmentioning

confidence: 99%

Trending IC design directions in 2022

Chan¹,

Cheng²,

Deng³

et al. 2022

J. Semicond.

View full text Add to dashboard Cite

For the non-stop demands for a better and smarter society, the number of electronic devices keeps increasing exponentially; and the computation power, communication data rate, smart sensing capability and intelligence are always not enough. Hardware supports software, while the integrated circuit (IC) is the core of hardware. In this long review paper, we summarize and discuss recent trending IC design directions and challenges, and try to give the readers big/cool pictures on each selected small/hot topics. We divide the trends into the following six categories, namely, 1) machine learning and artificial intelligence (AI) chips, 2) communication ICs, 3) data converters, 4) power converters, 5) imagers and range sensors, 6) emerging directions. Hope you find this paper useful for your future research and works.

show abstract

A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing

Cited by 30 publications

References 2 publications

A Low-Power Streaming Speech Enhancement Accelerator for Edge Devices

A Low-Power Streaming Speech Enhancement Accelerator for Edge Devices

A 3.8-μW 10-Keyword Noise-Robust Keyword Spotting Processor Using Symmetric Compressed Ternary-Weight Neural Networks

Trending IC design directions in 2022

Contact Info

Product

Resources

About