Hari Krishna Vydana scite author profile

This work tackles the problem of learning a set of language specific acoustic units from unlabeled speech recordings given a set of labeled recordings from other languages. Our approach may be described by the following two steps procedure: first the model learns the notion of acoustic units from the labelled data and then the model uses its knowledge to find new acoustic units on the target language. We implement this process with the Bayesian Subspace Hidden Markov Model (SHMM), a model akin to the Subspace Gaussian Mixture Model (SGMM) where each low dimensional embedding represents an acoustic unit rather than just a HMM's state. The subspace is trained on 3 languages from the GlobalPhone corpus (German, Polish and Spanish) and the AUs are discovered on the TIMIT corpus. Results, measured in equivalent Phone Error Rate, show that this approach significantly outperforms previous HMM based acoustic units discovery systems and compares favorably with the Variational Auto Encoder-HMM.

show abstract

Curriculum learning based approach for noise robust language identification using DNN with attention

Vuddagiri

Vydana

Vuppala

2018

Expert Systems with Applications

View full text Add to dashboard Cite

IIITH-ILSC Speech Database for Indain Language Identification

Vuddagiri¹,

Gurugubelli²,

Jain³

et al. 2018

View full text Add to dashboard Cite

Significance of GMM-UBM based Modelling for Indian Language Identification

Kumar¹,

Vydana²,

Vuppala³

2015

Procedia Computer Science

View full text Add to dashboard Cite

Jointly Trained Transformers Models for Spoken Language Translation

Vydana

Karafiát

Žmolíková

et al. 2021

View full text Add to dashboard Cite

End-to-End and cascade (ASR-MT) spoken language translation (SLT) systems are reaching comparable performances, however, a large degradation is observed when translating the ASR hypothesis in comparison to using oracle input text. In this work, degradation in performance is reduced by creating an End-to-End differentiable pipeline between the ASR and MT systems. In this work, we train SLT systems with ASR objective as an auxiliary loss and both the networks are connected through the neural hidden representations. This training has an End-to-End differentiable path with respect to the final objective function and utilizes the ASR objective for better optimization. This architecture has improved the BLEU score from 41.21 to 44.69. Ensembling the proposed architecture with independently trained ASR and MT systems further improved the BLEU score from 44.69 to 46.9. All the experiments are reported on English-Portuguese speech translation task using the How2 corpus. The final BLEU score is on-par with the best speech translation system on How2 dataset without using any additional training data and language model and using fewer parameters.

show abstract

Vowel-Based Non-uniform Prosody Modification for Emotion Conversion

Vydana¹,

Kadiri²,

Vuppala³

2015

Circuits Syst Signal Process

View full text Add to dashboard Cite

Bayesian Subspace Hidden Markov Model for Acoustic Unit Discovery

Ondel¹,

Vydana²,

Burget³

et al. 2019

Preprint

View full text Add to dashboard Cite

An Exploration towards Joint Acoustic Modeling for Indian Languages: IIIT-H Submission for Low Resource Speech Recognition Challenge for Indian Languages, INTERSPEECH 2018

Vydana

Gurugubelli

Vegesna

et al. 2018

View full text Add to dashboard Cite

India being a multilingual society, a multilingual automatic speech recognition system (ASR) is widely appreciated. Despite different orthographies, Indian languages share same phonetic space. To exploit this property, a joint acoustic model has been trained for developing multilingual ASR system using a common phone-set. Three Indian languages namely Telugu, Tamil and, Gujarati are considered for the study. This work studies the amenability of two different acoustic modeling approaches for training a joint acoustic model using common phone-set. Sub-space Gaussian mixture models (SGMM), and recurrent neural networks (RNN) trained with connectionst temporal classification (CTC) objective function are explored for training joint acoustic models. From the experimental results, it can be observed that the joint acoustic models trained with RNN-CTC have performed better than SGMM system even on 120 hours of data (approx 40 hrs per language). The joint acoustic model trained with RNN-CTC has performed better than monolingual models, due to an efficient data sharing across the languages. Conditioning the joint model with language identity had a minimal advantage. Sub-sampling the features by a factor of 2 while training RNN-CTC models has reduced the training times and has performed better.

show abstract

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.