An Exploration towards Joint Acoustic Modeling for Indian Languages: IIIT-H Submission for Low Resource Speech Recognition Challenge for Indian Languages, INTERSPEECH 2018

Vydana, Hari Krishna; Gurugubelli, Krishna; Vegesna, Vishnu Vidyadhara Raju; Vuppala, Anil Kumar

doi:10.21437/interspeech.2018-1584

Cited by 17 publications

(5 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The data sets are different. [52], [53], [49], and [55] used isolated Gujarati words, [54] used 25-word sentences, [56] did not limit to number of words in the sentences, [51], [57] used continuous speech of three Indian languages, and [50] used continuous speech of 9 Indian languages. The table highlights the accuracy achieved with Gujarati language.…”

Section: Mathematical Evaluation Of Resultsmentioning

confidence: 99%

See 1 more Smart Citation

G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost

Gupta

Singh²,

Singh

2022

Wireless Pers Commun

View full text Add to dashboard Cite

The pandemic caused due to COVID-19, has seen things going online. People tired of typing prefer to give voice commands. Most of the voice based applications and devices are not prepared to handle the native languages. Moreover, in a party environment it is difficult to identify a voice command as there are many speakers. The proposed work addresses the Cocktail party problem of Indian language, Gujarati. The voice response systems like, Siri, Alexa, Google Assistant as of now work on single voice command. The proposed algorithm G-Cocktail would help these applications to identify command given in Gujarati even from a mixed voice signal. Benchmark Dataset is taken from Microsoft and Linguistic Data Consortium for Indian Languages(LDC-IL) comprising single words and phrases. G-Cocktail utilizes the power of CatBoost algorithm to classify and identify the voice. Voice print of the entire sound files is created using Pitch, and Mel Frequency Cepstral Coefficients (MFCC). Seventy percent of the voice prints are used to train the network and thirty percent for testing. The proposed work is tested and compared with K-means, Naïve Bayes, and LightGBM.

show abstract

Section: Mathematical Evaluation Of Resultsmentioning

confidence: 99%

“…There is no historical evidence of Cocktail-party scene with Gujarati language [47][48][49][50][51][52][53][54][55][56][57]. For ASR in Gujarati, methods like Statistical, Neural Networks and End-to-end recognition are used [35].…”

Section: Introductionmentioning

confidence: 99%

G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost

Gupta

Singh²,

Singh

2022

Wireless Pers Commun

View full text Add to dashboard Cite

show abstract

“…Comparison of word error rates % (WER) for different approaches and different multi-stream approaches evaluated on target-language (Mandarin) test set. The lattice-combination approach is described in [12], Feature-combination approach is used in [14,34]. Considering all the results above, all the three feature fusion approaches perform better than the baseline (see Table 6).…”

Section: Combination Layer Configuration Wer[%]mentioning

confidence: 90%

“…In those experiments, both stops and nasals attributes were correctly detected, which can prove that the speech attribute can be used in cross-lingual speech recognition in English and Mandarin. There are few studies on multilingual speech recognition integrating AFs; Hari Krishna et al, trained a bank of AFs detectors using source language to predict the articulatory features for the target languages, which showed that the combination of AFs using AF-Tandem method performs better than the lattice-rescoring approach [14].…”

Section: Related Workmentioning

confidence: 99%

Domain-Adversarial Based Model with Phonological Knowledge for Cross-Lingual Speech Recognition

Zhan

Xie

et al. 2021

Electronics

View full text Add to dashboard Cite

Phonological-based features (articulatory features, AFs) describe the movements of the vocal organ which are shared across languages. This paper investigates a domain-adversarial neural network (DANN) to extract reliable AFs, and different multi-stream techniques are used for cross-lingual speech recognition. First, a novel universal phonological attributes definition is proposed for Mandarin, English, German and French. Then a DANN-based AFs detector is trained using source languages (English, German and French). When doing the cross-lingual speech recognition, the AFs detectors are used to transfer the phonological knowledge from source languages (English, German and French) to the target language (Mandarin). Two multi-stream approaches are introduced to fuse the acoustic features and cross-lingual AFs. In addition, the monolingual AFs system (i.e., the AFs are directly extracted from the target language) is also investigated. Experiments show that the performance of the AFs detector can be improved by using convolutional neural networks (CNN) with a domain-adversarial learning method. The multi-head attention (MHA) based multi-stream can reach the best performance compared to the baseline, cross-lingual adaptation approach, and other approaches. More specifically, the MHA-mode with cross-lingual AFs yields significant improvements over monolingual AFs with the restriction of training data size and, which can be easily extended to other low-resource languages.

show abstract

“…The 'Low Resource Speech Recognition Challenge for Indian Languages -Interspeech 2018' included 40 hours of speech corpora in Telugu, Tamil and Gujarati languages. Multilingual training was adapted wherein the acoustic model was trained in all three languages leading to an improvement of approximately 5-8% in WER [35,36,37,38,39]. However, these methods above reduce recognition errors in words already present in the ASR's lexicon.…”

Section: Data Augmentation In Asrmentioning

confidence: 99%

Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

Murthy¹,

Sitaram²

2023

IJST

View full text Add to dashboard Cite

Objectives: Improving the accuracy of low resource speech recognition in a model trained on only 4 hours of transcribed continuous speech in Kannada language, using data augmentation. Methods: Baseline language model is augmented with unigram counts of words, that are present in the Wikipedia text corpus but absent in the baseline, for initial decoding. Lattice rescoring is then applied using the language model augmented with Wikipedia text. Speech synthesis-based augmentation with multi-speaker syllable-based synthesis, using voices in Kannada and cross-lingual Telugu languages, is employed. We synthesize basic syllables, syllables with consonant conjuncts, and words that contain syllables that are absent in the training speech, for Kannada language. Findings: An overall word error rate (WER) of 9.04% is achieved over a baseline WER of 40.93%. Language model augmentation and lattice rescoring gives an absolute improvement of 16.68%. Applying our method of syllable-based speech synthesis over language model augmentation and rescoring yields a total reduction of 31.89% in WER. The proposed approach of language model augmentation is memory efficient and consumes only 1/8 th the memory required for decoding with Wikipedia augmented language model (2 gigabytes versus 18 gigabytes) while giving comparable WER (22.95% for Wikipedia versus 24.25% for our method). Augmentation with synthesized syllables enhances the ability of the speech recognition model to recognize basic sounds thus improving recognition of out-of-vocabulary words to 90% and in-vocabulary words to 97%. Novelty: We propose novel methods of language model augmentation and synthesis-based augmentation to achieve low WER for a speech recognition model trained on only 4 hours of continuous speech. Obtaining high recognition accuracy (or low WER) for very small speech corpus is a challenge. In this paper, we demonstrate that high accuracy can be achieved using data augmentation for a small corpus-based speech recognition.

show abstract

An Exploration towards Joint Acoustic Modeling for Indian Languages: IIIT-H Submission for Low Resource Speech Recognition Challenge for Indian Languages, INTERSPEECH 2018

Cited by 17 publications

References 26 publications

G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost

G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost

Domain-Adversarial Based Model with Phonological Knowledge for Cross-Lingual Speech Recognition

Low Resource Kannada Speech Recognition using Lattice Rescoring and Speech Synthesis

Contact Info

Product

Resources

About