Improving Mandarin Speech Recogntion with Block-augmented Transformer

Ren, Xiao‐Ming; Zhu, Hua; Liuwei, Wei,; Wu, Minghui; Huang, Jie

doi:10.48550/arxiv.2207.11697

Cited by 3 publications

(3 citation statements)

References 26 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Two methods of block integration were proposed by Xiaoming Ren et al [11] One approach involves weighted averaging, while the other employs SE modules for block integration. Their research demonstrates that the use of SE modules can significantly improve the accuracy of speech recognition models.…”

Section: Convolution-enhanced Channel Attention Blockmentioning

confidence: 99%

Automatic speech recognition based on conformer with convolution enhanced channel attention block

Sun,

Li,

Liu

2024

Fourth International Conference on Computer Vision and Data Mining (ICCVDM 2023)

View full text Add to dashboard Cite

Conformer has demonstrated outstanding performance in the field of speech recognition, representing a highly effective improvement over the Transformer model. Subsequently, research on the integration of encoder blocks emerged, such as Blockformer. In this paper, we investigated the integration method of Conformer's encoder blocks and proposed a more efficient approach for leveraging information from multiple blocks, which we refer to as Convolution-Enhanced Channel Attention Block Integration. We achieved a Character Error Rate (CER) of 4.58% on the Aishell-1 dataset.

show abstract

Section: Convolution-enhanced Channel Attention Blockmentioning

confidence: 99%

Automatic speech recognition based on conformer with convolution enhanced channel attention block

Sun,

Li,

Liu

2024

Fourth International Conference on Computer Vision and Data Mining (ICCVDM 2023)

View full text Add to dashboard Cite

show abstract

“…Blockformer [11] improves upon Conformer by adopting the SE block presented in SENet [12] . Initially, it squeezes the output information of all the blocks through global average pooling to derive the representative information of each block, and then it activates the information of this value through the two nonlinear layers to make the value more effective in characterizing the corresponding block, and then the processed value is used as weights, multiplied by the output of the corresponding block, and finally the weighted output of each block is summed to obtain the final result.…”

Section: Introductionmentioning

confidence: 99%

Automatic speech recognition based on attention-enhanced blockformer

Liu,

Zhan,

2024

Fourth International Conference on Computer Vision and Data Mining (ICCVDM 2023)

View full text Add to dashboard Cite

The Blockformer speech recognition model has recently been proposed as a state-of-the-art (SOTA) model ontheAishell-1 Chinese speech dataset. This model exhibited significant improvements in character error rate (CER) when compared to its baseline, Conformer. The key improvement of Blockformer is the addition of the Squeeze-and-Excitation (SE) block on top of Conformer, which enables better utilization of the information contained in each Conformer block. In our study of Blockformer, we identified scope for improving its block information extraction method. To this end, we used the attention mechanism to enhance the SE block's efficacy in squeezing block information. And we enhanced the model's structure in attention inference mode to align more effectively with the training structure. Under the four inference modes, namely attention, attention rescoring, ctc greedy search, and ctc prefix beam search, the CER reaches 4.67%, 4.43%, 4.75% and 4.75%. All of these rates are at the level of Blockformer or exceed it.

show abstract

“…Therefore, whether in streaming or non-stream speech recognition, the end-to-end speech recognition model based on Transformer has achieved very good results. [6,7] The self-attention mechanism in the Transformer model simulates human attention to important things. Various improvement approaches based on the Transformer model have become a hot topic in the research field.…”

Section: Introductionmentioning

confidence: 99%

New Unit Dot Product Similarity Method and Parallelized Greedy Soup Algorithm in The End-to-end Automatic Speech Recognition

Liu,

Wang,

Sun

et al. 2024

Preprint

View full text Add to dashboard Cite

This paper introduces a novel similarity calculation method, called unit dot product similarity method. The proposed method restricts the denominator by the sum of vector modulus. Compared with traditional dot product similarity calculation method, the proposed method can maintain the similarity of equally scaled vector and obtain the bounded similarity result. We develop and compare the proposed method in the attention-based encoder decoder structure. The proposed method brings the recognition results further improvement. For the end-to-end speech recognition model, we select greedy soup instead of the average model parameters in WeNet. A dynamic parallel greedy soup optimization algorithm is proposed to increase computational speed. The experiments show the importance of proposed method and optimization algorithm. The effectiveness is also proved on multiple corpora.

show abstract

Improving Mandarin Speech Recogntion with Block-augmented Transformer

Cited by 3 publications

References 26 publications

Automatic speech recognition based on conformer with convolution enhanced channel attention block

Automatic speech recognition based on conformer with convolution enhanced channel attention block

Automatic speech recognition based on attention-enhanced blockformer

New Unit Dot Product Similarity Method and Parallelized Greedy Soup Algorithm in The End-to-end Automatic Speech Recognition

Contact Info

Product

Resources

About