Anirudh Raju scite author profile

We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for smallfootprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a crossentropy pre-trained network shows the best performance, which yields 67.6% relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.

show abstract

Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-student Learning

Mošner

Raju

et al. 2019

View full text Add to dashboard Cite

For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacherstudent (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data. We incorporate up to 8000 hours of untranscribed data for training and present our results on sequence trained models apart from cross entropy trained ones. The best sequence trained student model yields relative word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our clean, simulated noisy and real test sets respectively comparing to a sequence trained teacher.Index Termsautomatic speech recognition, noise robustness, teacher-student training, domain adaptation * Ladislav Mosner performed the work while he was a research intern at Amazon.

show abstract

Time-Delayed Bottleneck Highway Networks Using a DFT Feature for Keyword Spotting

Guo

Kumatani

Sun

et al. 2018

View full text Add to dashboard Cite

Speech to Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces

Rao¹,

Raju²,

Dheram³

et al. 2020

View full text Add to dashboard Cite

On Evaluating and Comparing Open Domain Dialog Systems

Venkatesh¹,

Khatri²,

Ram³

et al. 2018

Preprint

View full text Add to dashboard Cite

Conversational agents are exploding in popularity. However, much work remains in the area of non goal-oriented conversations, despite significant growth in research interest over recent years. To advance the state of the art in conversational AI, Amazon launched the Alexa Prize, a 2.5-million dollar university competition where sixteen selected university teams built conversational agents to deliver the best social conversational experience. Alexa Prize provided the academic community with the unique opportunity to perform research with a live system used by millions of users. The subjectivity associated with evaluating conversations is key element underlying the challenge of building non-goal oriented dialogue systems. In this paper, we propose a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human judgement. The proposed metrics provide granular analysis of the conversational agents, which is not captured in human ratings. We show that these metrics can be used as a reasonable proxy for human judgment. We provide a mechanism to unify the metrics for selecting the top performing agents, which has also been applied throughout the Alexa Prize competition. To our knowledge, to date it is the largest setting for evaluating agents with millions of conversations and hundreds of thousands of ratings from users. We believe that this work is a step towards an automatic evaluation process for conversational AIs.

show abstract

Scalable Multi Corpora Neural Language Models for ASR

Raju

Filimonov

Tiwari

et al. 2019

View full text Add to dashboard Cite

Neural language models (NLM) have been shown to outperform conventional n-gram language models by a substantial margin in Automatic Speech Recognition (ASR) and other tasks. There are, however, a number of challenges that need to be addressed for an NLM to be used in a practical large-scale ASR system. In this paper, we present solutions to some of the challenges, including training NLM from heterogenous corpora, limiting latency impact and handling personalized bias in the second-pass rescorer. Overall, we show that we can achieve a 6.2% relative WER reduction using neural LM in a secondpass n-best rescoring framework with a minimal increase in latency.

show abstract

On joint training with interfaces for spoken language understanding

Raju¹,

Mv²,

Tiwari³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance. It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface. This interface allows for end-toend training using multi-task RNNT and NLU losses. Additionally, we introduce semantic sequence loss training for the joint RNNT-NLU system that allows direct optimization of non-differentiable SLU metrics. This end-to-end SLU model paradigm can leverage state-of-the-art advancements and pretrained models in both ASR and NLU research communities, outperforming recently proposed direct speech-to-semantics models, and conventional pipelined ASR and NLU systems. We show that this method improves both ASR and NLU metrics on both public SLU datasets and large proprietary datasets.

show abstract

Scalable Multi Corpora Neural Language Models for ASR

Raju¹,

Filimonov²,

Tiwari³

et al. 2019

Preprint

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Anirudh Raju

Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting

Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-student Learning

Time-Delayed Bottleneck Highway Networks Using a DFT Feature for Keyword Spotting

Speech to Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces

On Evaluating and Comparing Open Domain Dialog Systems

Scalable Multi Corpora Neural Language Models for ASR

On joint training with interfaces for spoken language understanding

Scalable Multi Corpora Neural Language Models for ASR

Contact Info

Product

Resources

About