2021
DOI: 10.48550/arxiv.2111.10770
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism

Abstract: There has been a rapid advance of custom hardware (HW) for accelerating the inference speed of deep neural networks (DNNs). Previously, the softmax layer was not a main concern of DNN accelerating HW, because its portion is relatively small in multi-layer perceptron or convolutional neural networks. However, as the attention mechanisms are widely used in various modern DNNs, a cost-efficient implementation of softmax layer is becoming very important. In this paper, we propose two methods to approximate softmax… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 28 publications
0
4
0
Order By: Relevance
“…The GAM-MLP model concludes with a fully connected layer followed by a Soft-Max layer for classification purposes [32]. The diagram illustrating the classification and recognition process in the MLP of layer plus SoftMax layer is shown in Figure 7 below.…”
Section: Activity Classification and Recognitionmentioning
confidence: 99%
“…The GAM-MLP model concludes with a fully connected layer followed by a Soft-Max layer for classification purposes [32]. The diagram illustrating the classification and recognition process in the MLP of layer plus SoftMax layer is shown in Figure 7 below.…”
Section: Activity Classification and Recognitionmentioning
confidence: 99%
“…• Approximate attention: This method uses a low-rank matrix or a random feature map to approximate the encoder output sequence, thereby reducing the amount of computation and memory consumption, while maintaining a certain degree of accuracy and effect (Vasyltsov and Chang, 2021). The formula for approximate attention is as follows:…”
Section: Attention Mechanismmentioning
confidence: 99%
“…In the work [28], a precision-adjustable architecture for the Softmax function was developed, with all inputs and outputs represented in 16-bit format, achieving both efficiency and adjustability. Furthermore, [27] even explored the use of 8bit quantization for Softmax function computations, achieving minimal precision loss while working with attention mechanisms in deep neural networks.…”
Section: Quantification Methodsmentioning
confidence: 99%
“…Exponential and division computations require substantial computational resources and time, potentially leading to increased hardware resource consumption and computation latency. Some researchers have made contributions in this regard: [27] proposed two methods using 8-bit fixed-point approximations based on lookup tables to compute Softmax, achieving an accuracy loss of less than 1%. In [28], a method utilizing lookup tables to implement a Precision-Adjustable approach for the Softmax function was employed.…”
Section: B Approximate Calculation Of Nonlinear Functionmentioning
confidence: 99%
See 1 more Smart Citation