Very deep convolutional networks for end-to-end speech recognition

Yu, Zhang; Chan, William; Jaitly, Navdeep

doi:10.1109/icassp.2017.7953077

Cited by 362 publications

(237 citation statements)

References 19 publications

Supporting

Mentioning

220

Contrasting

Unclassified

Order By: Relevance

“…Considering the limitations imposed on our model by stopping at a fixed evaluation epoch, it would be possible to further boost performance by utilizing early stopping with a validation set. And while the input features were selected from empirical observations made in previous studies, the results could be improved by extracting the features in an unsupervised manner using autoencoders (Poultney et al, 2007;Le et al, 2011) or by training the decoder end-to-end using convolutional LSTMs (Shi et al, 2015;Zhang et al, 2016).…”

Section: Discussionmentioning

confidence: 99%

Sequence Transfer Learning for Neural Decoding

Elango

Patel

Miller

et al. 2017

Preprint

View full text Add to dashboard Cite

A fundamental challenge in designing brain-computer interfaces (BCIs) is decoding behavior from time-varying neural oscillations. In typical applications, decoders are constructed for individual subjects and with limited data leading to restrictions on the types of models that can be utilized. Currently, the best performing decoders are typically linear models capable of utilizing rigid timing constraints with limited training data. Here we demonstrate the use of Long Short-Term Memory (LSTM) networks to take advantage of the temporal information present in sequential neural data collected from subjects implanted with electrocorticographic (ECoG) electrode arrays performing a finger flexion task. Our constructed models are capable of achieving accuracies that are comparable to existing techniques while also being robust to variation in sample data size. Moreover, we utilize the LSTM networks and an affine transformation layer to construct a novel architecture for transfer learning. We demonstrate that in scenarios where only the affine transform is learned for a new subject, it is possible to achieve results comparable to existing state-of-the-art techniques. The notable advantage is the increased stability of the model during training on novel subjects. Relaxing the constraint of only training the affine transformation, we establish our model as capable of exceeding performance of current models across all training data sizes. Overall, this work demonstrates that LSTMs are a versatile model that can accurately capture temporal patterns in neural data and can provide a foundation for transfer learning in neural decoding.

show abstract

Section: Discussionmentioning

confidence: 99%

Sequence Transfer Learning for Neural Decoding

Elango

Patel

Miller

et al. 2017

Preprint

View full text Add to dashboard Cite

show abstract

“…End-to-end models have become a popular choice for speech recognition, thanks to both the simplicity of building them and their superior performance over conventional systems [3,4,5,6,7,8,9,10,11,12,1,2]. In contrast to conventional systems, which are comprised of separate acoustic, pronunciation, and language modeling components, end-to-end approaches formulate the speech recognition problem directly as a mapping from utterances to transcripts, which greatly simplifies the training and decoding processes.…”

Section: Introductionmentioning

confidence: 99%

A Comparison of End-to-End Models for Long-Form Speech Recognition

Chiu

Kannan

Prabhavalkar

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1,2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attentionbased end-to-end models can be very competitive to RNN-T on long-form speech recognition.

show abstract

“…The default encoder we used is a 4-layer stacked 2-dimensional convolution (with batch normalization between layers), with kernel size (3, 3) on both the time frame axis and the feature axis [32,11]. 2×-downsampling is employed at layer 1 and 3, resulting in 1 /4 time frames after convolution.…”

Section: Cnn-lstm Encodermentioning

confidence: 99%

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Wang

Khudanpur

Chen

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

We present ESPRESSO, an open-source, modular, extensible endto-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit FAIRSEQ. ESPRESSO supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented. ESPRESSO achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4-11× faster for decoding than similar systems (e.g. ESPNET).

show abstract

Very deep convolutional networks for end-to-end speech recognition

Cited by 362 publications

References 19 publications

Sequence Transfer Learning for Neural Decoding

Sequence Transfer Learning for Neural Decoding

A Comparison of End-to-End Models for Long-Form Speech Recognition

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Contact Info

Product

Resources

About