2022
DOI: 10.48550/arxiv.2207.11697
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Improving Mandarin Speech Recogntion with Block-augmented Transformer

Abstract: Recently Convolution-augmented Transformer (Conformer)[1] has shown promising results in Automatic Speech Recognition (ASR), outperforming the previous best published Transformer Transducer[2]. In this work, we believe that the output information of each block in the encoder and decoder is not completely inclusive, in other words, their output information may be complementary. We study how to take advantage of the complementary information of each block in a parameter-efficient way, and it is expected that thi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 26 publications
0
3
0
Order By: Relevance
“…Two methods of block integration were proposed by Xiaoming Ren et al [11] One approach involves weighted averaging, while the other employs SE modules for block integration. Their research demonstrates that the use of SE modules can significantly improve the accuracy of speech recognition models.…”
Section: Convolution-enhanced Channel Attention Blockmentioning
confidence: 99%
“…Two methods of block integration were proposed by Xiaoming Ren et al [11] One approach involves weighted averaging, while the other employs SE modules for block integration. Their research demonstrates that the use of SE modules can significantly improve the accuracy of speech recognition models.…”
Section: Convolution-enhanced Channel Attention Blockmentioning
confidence: 99%
“…Blockformer [11] improves upon Conformer by adopting the SE block presented in SENet [12] . Initially, it squeezes the output information of all the blocks through global average pooling to derive the representative information of each block, and then it activates the information of this value through the two nonlinear layers to make the value more effective in characterizing the corresponding block, and then the processed value is used as weights, multiplied by the output of the corresponding block, and finally the weighted output of each block is summed to obtain the final result.…”
Section: Introductionmentioning
confidence: 99%
“…Therefore, whether in streaming or non-stream speech recognition, the end-to-end speech recognition model based on Transformer has achieved very good results. [6,7] The self-attention mechanism in the Transformer model simulates human attention to important things. Various improvement approaches based on the Transformer model have become a hot topic in the research field.…”
Section: Introductionmentioning
confidence: 99%