Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1555
|View full text |Cite
|
Sign up to set email alerts
|

Effect of TTS Generated Audio on OOV Detection and Word Error Rate in ASR for Low-resource Languages

Abstract: Out-of-Vocabulary (OOV) detection and recovery is an important aspect of reducing Word Error Rate (WER) in Automatic Speech Recognition (ASR). In this paper, we evaluate the effect on WER for a low-resource language ASR system using OOV detection and recovery. We use a small seed corpus of continuous speech and improve the vocabulary by incorporating the detected OOV words. We use a syllable-model to detect and learn OOV words and, augment the word-model with these words leading to improved recognition. Our re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
9
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 10 publications
(9 citation statements)
references
References 25 publications
(23 reference statements)
0
9
0
Order By: Relevance
“…Studies that involve language model augmentation select sentences from a large external text corpus based on certain scores assigned to the sentences (12,14) . There is always a question of how much to select without making the augmented language model size very large for decoding.…”
Section: Language Model Augmentation and Lattice Rescoringmentioning
confidence: 99%
See 3 more Smart Citations
“…Studies that involve language model augmentation select sentences from a large external text corpus based on certain scores assigned to the sentences (12,14) . There is always a question of how much to select without making the augmented language model size very large for decoding.…”
Section: Language Model Augmentation and Lattice Rescoringmentioning
confidence: 99%
“…There is always a question of how much to select without making the augmented language model size very large for decoding. For example, the work in (12) selects the first 50 sentences of Kannada Wikipedia that contain certain OOV words. However, in case of only 4 hours of baseline speech, every sentence in a large corpus may contain an OOV word.…”
Section: Language Model Augmentation and Lattice Rescoringmentioning
confidence: 99%
See 2 more Smart Citations
“…Therefore, how to get a large amount of paired speech-text data with a small cost is a practical problem for RNN-T model adaptation. The most popular method is to synthesize speech from the text of the new domain using text to speech (TTS) [19,25,26,27]. Although no real speech data is needed to be collected, it has limitations: 1) the speaker variation in TTS data is very limited especially compared with the real production data, 2) TTS data dilutes the acoustic variation contained in real speech data.…”
Section: Introductionmentioning
confidence: 99%