Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition

Lakomkin, Egor; Heymann, Jahn; Sklyar, Ilya; Wiesler, Simon

doi:10.21437/interspeech.2020-1569

Cited by 7 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To analyze OOV word recognition performance, we used an F-score metric similar to [ 22 ]. The method based on counting after decoding how many times the model emitted (true positive,

) or did not emit (false negative,

) the OOV words from the evaluation set.…”

Section: Methods Descriptionmentioning

confidence: 99%

See 1 more Smart Citation

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Laptev

Andrusenko

Podluzhny

et al. 2021

Sensors

View full text Add to dashboard Cite

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

show abstract

“…To analyze OOV word recognition performance, we used an F-score metric similar to [ 22 ]. The method based on counting after decoding how many times the model emitted (true positive,

) or did not emit (false negative,

) the OOV words from the evaluation set.…”

Section: Methods Descriptionmentioning

confidence: 99%

“…There are few previous works on ASR related to the investigation of subword augmentation by non-deterministic segmentation. The vanilla subword regularization was studied in [ 21 , 22 ]. In the first work, the method was applied for the WSJ dataset (English, 50 h).…”

Section: Introductionmentioning

confidence: 99%

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Laptev

Andrusenko

Podluzhny

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…The method helped prevent over-fitted and over-confident models, and it could distinguish plausible target words from incorrect ones. Subwords are the most popular applied output units in an E2E ASR system [84]. The researchers in [81] tested subword regularization with both CTC-based and attention-based ASR models.…”

Section: ) Vocabularymentioning

confidence: 99%

“…They also showed that uniform greedy sampling of subword units, which is much faster than LSD, was also an effective decomposition strategy when combined with n-gram loss. In [84], the researchers investigated the regularizing influence of the subword segmentation sampling approach on a streaming task of E2E ASR. They evaluated the contribution of subword regularization that relied on the training dataset size, and the results suggested that subword regularization provided a consistent reduction of WER.…”

Section: ) Vocabularymentioning

confidence: 99%

Automatic Speech Recognition: Systematic Literature Review

et al. 2021

View full text Add to dashboard Cite

A huge amount of research has been done in the field of speech signal processing in recent years. In particular, there has been increasing interest in the automatic speech recognition (ASR) technology field. ASR began with simple systems that responded to a limited number of sounds and has evolved into sophisticated systems that respond fluently to natural language. This systematic review of automatic speech recognition is provided to help other researchers with the most significant topics published in the last six years. This research will also help in identifying recent major ASR challenges in real-world environments. In addition, it discusses current research gaps in ASR. This review covers articles available in five research databases that were completed according to the preferred reporting items for systematic reviews and metaanalyses (PRISMA) protocol. The search strategy yielded 45 articles related to the study's scope for the period 2015-2020. The results presented in this review shed light on research trends in the area of ASR and also suggest new research directions.

show abstract

“…In addition, many works try to leverage multi-modeling units to jointly optimize the E2E ASR. Lakomkin et al [17] point out that combining several segmentations of an utterance transcription in the loss function to optimize the E2E ASR model may be beneficial to the model. Krishna et al [18] proposes phoneme and word-piece CTC loss to joint learning based on BiLSTM model.…”

Section: Introductionmentioning

confidence: 99%

PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition

Ma¹,

Hu²,

Yolwas³

et al. 2021

Preprint

View full text Add to dashboard Cite

Consonant and vowel reduction are often encountered in Uyghur speech, which might cause performance degradation in Uyghur automatic speech recognition (ASR). Our recently proposed learning strategy based on masking, Phone Masking Training (PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT achieves remarkably improvements, there still exists room for further gains due to the granularity mismatch between masking unit of PMT (phoneme) and modeling unit (word-piece). To boost the performance of PMT, we propose multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT). The idea of MMUT framework is to split the Encoder into two parts including acoustic feature sequences to phoneme-level representation (AF-to-PLR) and phoneme-level representation to word-piece-level representation (PLR-to-WPLR). It allows AF-to-PLR to be optimized by an intermediate phoneme-based CTC loss to learn the rich phoneme-level context information brought by PMT. Experimental results on Uyghur ASR show that the proposed approaches improve significantly, outperforming the pure PMT (reduction WER from 24.0 to 23.7 on Read-Test and from 38.4 to 36.8 on Oral-Test respectively). We also conduct experiments on the 960-hour Librispeech benchmark using ESPnet1, which achieves about 10% relative WER on all the test sets without LM fusion comparing with the latest official ESPnet1 pre-trained model.

show abstract

Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition

Cited by 7 publications

References 0 publications

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Automatic Speech Recognition: Systematic Literature Review

PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition

Contact Info

Product

Resources

About