Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition

Sainath, Tara N.; Weiss, Ron; Wilson, Kevin; Li, Bo; Narayanan, Arun; Variani, Ehsan; Bacchiani, Michiel; Shafran, Izhak; Senior, Andrew W.; Chin, K. K.; Misra, Ananya; Kim, Chanwoo

doi:10.1109/taslp.2017.2672401

Cited by 199 publications

(83 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Training of neural networks, which operate on the raw signals that are optimized for the discriminative cost function of the acoustic model, has also been recently explored. These approaches are termed as Neural Beamforming approaches as the neural network acoustic model subsumes the functionality of the beamformer [20,21].…”

Section: Retaled Prior Workmentioning

confidence: 99%

3-D Acoustic Modeling for Far-Field Multi-Channel Speech Recognition

Purushothaman

Sreeram

Ganapathy

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Automatic speech recognition in multi-channel reverberant conditions is a challenging task. The conventional way of suppressing the reverberation artifacts involves a beamforming based enhancement of the multi-channel speech signal, which is used to extract spectrogram based features for a neural network acoustic model. In this paper, we propose to extract features directly from the multi-channel speech signal using a multi variate autoregressive (MAR) modeling approach, where the correlations among all the three dimensions of time, frequency and channel are exploited. The MAR features are fed to a convolutional neural network (CNN) architecture which performs the joint acoustic modeling on the three dimensions. The 3-D CNN architecture allows the combination of multi-channel features that optimize the speech recognition cost compared to the traditional beamforming models that focus on the enhancement task. Experiments are conducted on the CHiME-3 and REVERB Challenge dataset using multi-channel reverberant speech. In these experiments, the proposed 3-D feature and acoustic modeling approach provides significant improvements over an ASR system trained with beamformed audio (average relative improvements of 10% and 9% in word error rates for CHiME-3 and REVERB Challenge datasets respectively).

show abstract

Section: Retaled Prior Workmentioning

confidence: 99%

3-D Acoustic Modeling for Far-Field Multi-Channel Speech Recognition

Purushothaman

Sreeram

Ganapathy

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Prior work has shown that learning a multi-channel front-end jointly with the AM using the ASR objective can improve far-field performances. In [8], Sainath et al showed that input from a data-driven multi-channel front-end provides better results than both single-channel and beamformed input. They introduce a set of convolutional filters applied directly to the raw audio [8].…”

Section: Introductionmentioning

confidence: 99%

“…In [8], Sainath et al showed that input from a data-driven multi-channel front-end provides better results than both single-channel and beamformed input. They introduce a set of convolutional filters applied directly to the raw audio [8]. The convolutional and linear structures are both designed to explicitly incorporate multiple beamformer "look directions", subsuming a multi-geometry beamforming component into the deep neural network (DNN).…”

Section: Introductionmentioning

confidence: 99%

Fully Learnable Front-End for Multi-Channel Acoustic Modeling Using Semi-Supervised Learning

Wager

Khare

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained using the logits from the teacher model. In our experiments, compared to a baseline model trained on about 600 hours of transcribed data, a relative word-error rate (WER) reduction of about 27.3% was achieved when using an additional 1800 hours of untranscribed data. We also investigated the benefit of pre-training the multi-channel front end to output the beamformed logmel filter bank energies (LFBE) using L2 loss. We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly initialized with a beamformer and mel-filter bank coefficients for the front end. Finally, combining pre-training and teacher-student training produces a WER reduction of 31% compared to our baseline.

show abstract

“…In recent years, deep learning techniques have significantly improved speech recognition accuracy [4,5,6,7,8]. This improvement has come about from the shift from Gaussian Mixture Model (GMM) to the Feed-Forward Deep Neural Networks (FF-DNNs), FF-DNNs to Recurrent Neural Network (RNN) and in particular the Long Short-Term Memory (LSTM) networks [9].…”

Section: Introductionmentioning

confidence: 99%

“…LibriSpeech LM corpus. The best performance was achieved when the window length is 50 ms and the warping coefficients are uniformly distributed between 0 8. and 1.2.…”

mentioning

confidence: 99%

End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System

Kim

Shin

Singh

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems.Our training system utilizes a cluster of Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed "on-the-fly". We use vocal tract length perturbation [1] and an acoustic simulator [2] for data augmentation. The processed features and labels are sent to the GPU cluster. The Horovod allreduce approach is employed to train neural network parameters. We evaluated the effectiveness of our system on the standard Librispeech corpus [3] and the 10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM). For the proprietary English Bixby open domain test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention (BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the monotonic chunckwise attention (MoCha) based approach is employed for streaming speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain test set.

show abstract

Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition

Cited by 199 publications

References 28 publications

3-D Acoustic Modeling for Far-Field Multi-Channel Speech Recognition

3-D Acoustic Modeling for Far-Field Multi-Channel Speech Recognition

Fully Learnable Front-End for Multi-Channel Acoustic Modeling Using Semi-Supervised Learning

End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System

Contact Info

Product

Resources

About