2019
DOI: 10.48550/arxiv.1912.03010
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Semantic Mask for Transformer based End-to-End Speech Recognition

Abstract: Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-tospeech (TTS) tasks. This approach takes advantage of the memorization capacity of neural networks to learn the mapping from the input sequence to the output sequence from scratch, without the assumption of prior knowledge such as the alignments. However, this model is prone to overfitting, especially when the amount of training data is limited. Inspired by SpecAugment and BERT, in this p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
30
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
7
1

Relationship

4
4

Authors

Journals

citations
Cited by 26 publications
(30 citation statements)
references
References 14 publications
0
30
0
Order By: Relevance
“…Our large baseline model gives better test-clean and testother WER results than the reported results in the ESPNET Github repository 2 . Also, our large baseline model outperforms a very recent work with semantic masking technique for transformer ASR [34]. With LM fusion, our proposed fine-tuning approach gives the best results for both small and large settings.…”
Section: Resultsmentioning
confidence: 74%
See 1 more Smart Citation
“…Our large baseline model gives better test-clean and testother WER results than the reported results in the ESPNET Github repository 2 . Also, our large baseline model outperforms a very recent work with semantic masking technique for transformer ASR [34]. With LM fusion, our proposed fine-tuning approach gives the best results for both small and large settings.…”
Section: Resultsmentioning
confidence: 74%
“…[9] focuses on multi-task learning with CTC, and show that transformerbased end-to-end ASR is highly competitive with RNN-based methods. In [34], a semantic mask based regularization method was introduced for transformer-based ASR to force the decoder to learn a better language model. A hybrid transformer model with deep layers and iterated loss was introduced in [31].…”
Section: Relation To Prior Workmentioning
confidence: 99%
“…Meanwhile, another 5% WER reduction is observed by better initialization (LFA→LASR). We can also see that, even though the separation model is tuned with ASRmatched, the consistent performance improvement still exists when the ASR model is changed to ASR [38] which uses a slightly different feature and architecture. Finally, compared to the state-of-the-art system reported in [17], our model shows a significant performance improvement, reducing the average WER from 9.1% to 5.7%.…”
Section: Architecture Comparisonmentioning
confidence: 81%
“…This model shows WERs of 2.80% and 6.80% on Librispeech test-clean and test-other, respectively. The second ASR is the one developed in [38], which we call ASR [38]. A concatenation of the filter bank and pitch features are used as the input to this model, and it achieves WERs of 2.08% and 4.95% on test-clean and test-other, respectively.…”
Section: Asr Modelmentioning
confidence: 99%
“…With the emergence of E2E ASR, researchers [7][8][9][10] explore different E2E ASR scenarios and partly focus on the data augmentation and training strategy due to the nature of data hungry and easy over-fitting. For Uyghur speech recognition, existing works mainly focused on problems such as agglutinative characteristics and low resources, often in conventional structure [11][12][13].…”
Section: Introductionmentioning
confidence: 99%