Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-1446
|View full text |Cite
|
Sign up to set email alerts
|

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Abstract: Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
160
1
1

Year Published

2017
2017
2022
2022

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 239 publications
(168 citation statements)
references
References 28 publications
1
160
1
1
Order By: Relevance
“…LSTM networks have been used for modeling time series data in many fields [28]. Two common approaches are (1) encoding/decoding [40] where information is forced to be recovered by every time step, and (2) sequence-tosequence [41] where the network takes all input first, then decodes them into a different time series. The former focuses on learning the inter-frame dependencies while the latter targets at the mappings between sequences.…”
Section: Spatio-temporal Recurrent Neural Network (Strnn)mentioning
confidence: 99%
“…LSTM networks have been used for modeling time series data in many fields [28]. Two common approaches are (1) encoding/decoding [40] where information is forced to be recovered by every time step, and (2) sequence-tosequence [41] where the network takes all input first, then decodes them into a different time series. The former focuses on learning the inter-frame dependencies while the latter targets at the mappings between sequences.…”
Section: Spatio-temporal Recurrent Neural Network (Strnn)mentioning
confidence: 99%
“…the phoneme label before training. CNN followed by RNN architectures have shown strong ability in dealing sequencerelated problems such as sense text recognition [22] and ASR [23]. These make the ASR network easy to train and perform better with fewer parameters.…”
Section: Layermentioning
confidence: 99%
“…The comparable performance with temporal representation-based methods suggests the DAC could be a potential substitute for RNN in some specific areas. Actually, the RNN itself is computationally expensive and sometimes difficult to train [67]. We directly model the dependency in the feature-level, which is faster than the temporal representation of original images [22], and more effective than the adversarial face generation-based method [33].…”
Section: Results On Youtube Face Datasetmentioning
confidence: 99%