An Automatic Portable Near Surface Soil Characterization System

In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems.Our training system utilizes a cluster of Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed "on-the-fly". We use vocal tract length perturbation [1] and an acoustic simulator [2] for data augmentation. The processed features and labels are sent to the GPU cluster. The Horovod allreduce approach is employed to train neural network parameters. We evaluated the effectiveness of our system on the standard Librispeech corpus [3] and the 10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM). For the proprietary English Bixby open domain test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention (BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the monotonic chunckwise attention (MoCha) based approach is employed for streaming speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain test set.

show abstract

Meta-Learning With Latent Space Clustering in Generative Adversarial Network for Speaker Diarization

Pal

Kumar

Peri

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Speaker Diarization Using Latent Space Clustering in Generative Adversarial Network

Pal

Kumar

Peri

et al. 2020

View full text Add to dashboard Cite

In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent variable recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a combination of continuous random variables and discrete one-hot encoded variables using the original speaker labels. We benchmark our proposed system on the AMI meeting corpus, and two child-clinician interaction corpora (ADOS and BOSCC) from the autism diagnosis domain. ADOS and BOSCC contain diagnostic and treatment outcome sessions respectively obtained in clinical settings for verbal children and adolescents with autism. Experimental results show that our proposed system significantly outperform the state-of-the-art x-vector based diarization system on these databases. Further, we perform embedding fusion with x-vectors to achieve a relative DER improvement of 31%, 36% and 49% on AMI eval, ADOS and BOSCC corpora respectively, when compared to the x-vector baseline using oracle speech segmentation.

show abstract

Multi-Task Multi-Resolution Char-to-BPE Cross-Attention Decoder for End-to-End Speech Recognition

Gowda¹,

Garg²,

Kim³

et al. 2019

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Manoj Kumar

Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap

End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System

Meta-Learning With Latent Space Clustering in Generative Adversarial Network for Speaker Diarization

Speaker Diarization Using Latent Space Clustering in Generative Adversarial Network

Multi-Task Multi-Resolution Char-to-BPE Cross-Attention Decoder for End-to-End Speech Recognition

Contact Info

Product

Resources

About