Effect of TTS Generated Audio on OOV Detection and Word Error Rate in ASR for Low-resource Languages

Murthy, Savitha; Sitaram, Dinkar; Sitaram, Sunayana

doi:10.21437/interspeech.2018-1555

Cited by 10 publications

(9 citation statements)

References 25 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Studies that involve language model augmentation select sentences from a large external text corpus based on certain scores assigned to the sentences (12,14) . There is always a question of how much to select without making the augmented language model size very large for decoding.…”

Section: Language Model Augmentation and Lattice Rescoringmentioning

confidence: 99%

“…There is always a question of how much to select without making the augmented language model size very large for decoding. For example, the work in (12) selects the first 50 sentences of Kannada Wikipedia that contain certain OOV words. However, in case of only 4 hours of baseline speech, every sentence in a large corpus may contain an OOV word.…”

Section: Language Model Augmentation and Lattice Rescoringmentioning

confidence: 99%

“…Language model augmentation and speech synthesis-based augmentation are two other data augmentation techniques that have been applied for Kannada language in (12) . Language model augmentation involves enhancing the language model with text from an external text corpus, for example, Wikipedia.…”

Section: Introductionmentioning

confidence: 99%

“…Speech synthesis-based augmentation is a technique where the training speech data is enhanced with synthesized speech to improve the acoustic capability of the ASR. The work by Murthy et al in (12) applies language model augmentation and synthesis-based augmentation on a low resource continuous speech dataset of 4 hours in Kannada. The authors achieve an absolute improvement of 8.62% over the baseline, resulting in WER of 38.02%.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

Murthy¹,

Sitaram²

2023

IJST

Self Cite

View full text Add to dashboard Cite

Objectives: Improving the accuracy of low resource speech recognition in a model trained on only 4 hours of transcribed continuous speech in Kannada language, using data augmentation. Methods: Baseline language model is augmented with unigram counts of words, that are present in the Wikipedia text corpus but absent in the baseline, for initial decoding. Lattice rescoring is then applied using the language model augmented with Wikipedia text. Speech synthesis-based augmentation with multi-speaker syllable-based synthesis, using voices in Kannada and cross-lingual Telugu languages, is employed. We synthesize basic syllables, syllables with consonant conjuncts, and words that contain syllables that are absent in the training speech, for Kannada language. Findings: An overall word error rate (WER) of 9.04% is achieved over a baseline WER of 40.93%. Language model augmentation and lattice rescoring gives an absolute improvement of 16.68%. Applying our method of syllable-based speech synthesis over language model augmentation and rescoring yields a total reduction of 31.89% in WER. The proposed approach of language model augmentation is memory efficient and consumes only 1/8 th the memory required for decoding with Wikipedia augmented language model (2 gigabytes versus 18 gigabytes) while giving comparable WER (22.95% for Wikipedia versus 24.25% for our method). Augmentation with synthesized syllables enhances the ability of the speech recognition model to recognize basic sounds thus improving recognition of out-of-vocabulary words to 90% and in-vocabulary words to 97%. Novelty: We propose novel methods of language model augmentation and synthesis-based augmentation to achieve low WER for a speech recognition model trained on only 4 hours of continuous speech. Obtaining high recognition accuracy (or low WER) for very small speech corpus is a challenge. In this paper, we demonstrate that high accuracy can be achieved using data augmentation for a small corpus-based speech recognition.

show abstract

Section: Language Model Augmentation and Lattice Rescoringmentioning

confidence: 99%

Section: Language Model Augmentation and Lattice Rescoringmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

Murthy¹,

Sitaram²

2023

IJST

Self Cite

View full text Add to dashboard Cite

show abstract

“…Therefore, how to get a large amount of paired speech-text data with a small cost is a practical problem for RNN-T model adaptation. The most popular method is to synthesize speech from the text of the new domain using text to speech (TTS) [19,25,26,27]. Although no real speech data is needed to be collected, it has limitations: 1) the speaker variation in TTS data is very limited especially compared with the real production data, 2) TTS data dilutes the acoustic variation contained in real speech data.…”

Section: Introductionmentioning

confidence: 99%

On Addressing Practical Challenges for RNN-Transducer

Zhao

Xue

et al. 2021

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

In this paper, several works are proposed to address practical challenges for deploying RNN Transducer (RNN-T) based speech recognition system. These challenges are adapting a well-trained RNN-T model to a new domain without collecting the audio data, obtaining time stamps and confidence scores at word level. The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data. To get the time stamp, a phone prediction branch is added to the RNN-T model by sharing the encoder for the purpose of force alignment. Finally, we obtain word-level confidence scores by utilizing several types of features calculated during decoding and from confusion network. Evaluated with Microsoft production data, the splicing data adaptation method improves the baseline and adaption with the text to speech method by 58.03% and 15.25% relative word error rate reduction, respectively. The proposed time stamping method can get less than 50ms word timing difference on average while maintaining the recognition accuracy of the RNN-T model. We also obtain high confidence annotation performance with limited computation cost.

show abstract

Quality Assurance for Speech Synthesis with ASR

Peinl

Wirth

2022

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

Effect of TTS Generated Audio on OOV Detection and Word Error Rate in ASR for Low-resource Languages

Cited by 10 publications

References 25 publications

Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

On Addressing Practical Challenges for RNN-Transducer

Quality Assurance for Speech Synthesis with ASR

Contact Info

Product

Resources

About