2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7953077
|View full text |Cite
|
Sign up to set email alerts
|

Very deep convolutional networks for end-to-end speech recognition

Abstract: Sequence-to-sequence models have shown success in end-to-end speech recognition. However these models have only used shallow acoustic encoder networks. In our work, we successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models. We apply network-in-network principles, batch normalization, residual connections and convolutional LSTMs to build very deep recurrent and convolutional structures. Our models exploit the spectral structure in the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
220
0
1

Year Published

2017
2017
2020
2020

Publication Types

Select...
5
3
2

Relationship

1
9

Authors

Journals

citations
Cited by 362 publications
(237 citation statements)
references
References 19 publications
1
220
0
1
Order By: Relevance
“…Considering the limitations imposed on our model by stopping at a fixed evaluation epoch, it would be possible to further boost performance by utilizing early stopping with a validation set. And while the input features were selected from empirical observations made in previous studies, the results could be improved by extracting the features in an unsupervised manner using autoencoders (Poultney et al, 2007;Le et al, 2011) or by training the decoder end-to-end using convolutional LSTMs (Shi et al, 2015;Zhang et al, 2016).…”
Section: Discussionmentioning
confidence: 99%
“…Considering the limitations imposed on our model by stopping at a fixed evaluation epoch, it would be possible to further boost performance by utilizing early stopping with a validation set. And while the input features were selected from empirical observations made in previous studies, the results could be improved by extracting the features in an unsupervised manner using autoencoders (Poultney et al, 2007;Le et al, 2011) or by training the decoder end-to-end using convolutional LSTMs (Shi et al, 2015;Zhang et al, 2016).…”
Section: Discussionmentioning
confidence: 99%
“…End-to-end models have become a popular choice for speech recognition, thanks to both the simplicity of building them and their superior performance over conventional systems [3,4,5,6,7,8,9,10,11,12,1,2]. In contrast to conventional systems, which are comprised of separate acoustic, pronunciation, and language modeling components, end-to-end approaches formulate the speech recognition problem directly as a mapping from utterances to transcripts, which greatly simplifies the training and decoding processes.…”
Section: Introductionmentioning
confidence: 99%
“…The default encoder we used is a 4-layer stacked 2-dimensional convolution (with batch normalization between layers), with kernel size (3, 3) on both the time frame axis and the feature axis [32,11]. 2×-downsampling is employed at layer 1 and 3, resulting in 1 /4 time frames after convolution.…”
Section: Cnn-lstm Encodermentioning
confidence: 99%