Shakti P. Rath scite author profile

The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speechto-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (∼10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance.

show abstract

Data augmentation for low resource languages

Ragni¹,

Knill²,

Rath³

et al. 2014

View full text Add to dashboard Cite

Recently there has been interest in the approaches for training speech recognition systems for languages with limited resources. Under the IARPA Babel program such resources have been provided for a range of languages to support this research area. This paper examines a particular form of approach, data augmentation, that can be applied to these situations. Data augmentation schemes aim to increase the quantity of data available to train the system, for example semi-supervised training, multilingual processing, acoustic data perturbation and speech synthesis. To date the majority of work has considered individual data augmentation schemes, with few consistent performance contrasts or examination of whether the schemes are complementary. In this work two data augmentation schemes, semisupervised training and vocal tract length perturbation, are examined and combined on the Babel limited language pack configuration. Here only about 10 hours of transcribed acoustic data are available. Two languages are examined, Assamese and Zulu, which were found to be the most challenging of the Babel languages released for the 2014 Evaluation. For both languages consistent speech recognition performance gains can be obtained using these augmentation schemes. Furthermore the impact of these performance gains on a down-stream keyword spotting task are also described.

show abstract

A Multi-Accent Acoustic Model Using Mixture of Experts for Speech Recognition

Jain¹,

Singh

Rath

2019

View full text Add to dashboard Cite

A major challenge in Automatic Speech Recognition(ASR) systems is to handle speech from a diverse set of accents. A model trained using a single accent performs rather poorly when confronted with different accents. One of the solutions is a multicondition model trained on all the accents. However the performance improvement in this approach might be rather limited. Otherwise, accent-specific models might be trained but they become impractical as number of accents increases. In this paper, we propose a novel acoustic model architecture based on Mixture of Experts (MoE) which works well on multiple accents without having the overhead of training separate models for separate accents. The work is based on our earlier work, termed as MixNet, where we showed performance improvement by separation of phonetic class distributions in the feature space. In this paper, we propose an architecture that helps to compensate phonetic and accent variabilities which helps in even better discrimination among the classes. These variabilities are learned in a joint framework , and produce consistent improvements over all the individual accents, amounting to an overall 18% relative improvement in accuracy compared to baseline trained in multi-condition style.

show abstract

Improved feature processing for deep neural networks

et al. 2013

View full text Add to dashboard Cite

Joint Distribution Learning in the Framework of Variational Autoencoders for Far-Field Speech Enhancement

Chelimilla

Kumar

Rath

2019

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shakti P. Rath

Investigation of multilingual deep neural networks for spoken term detection

Data augmentation for low resource languages

A Multi-Accent Acoustic Model Using Mixture of Experts for Speech Recognition

Improved feature processing for deep neural networks

Joint Distribution Learning in the Framework of Variational Autoencoders for Far-Field Speech Enhancement

Contact Info

Product

Resources

About