Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Zhang, Ying; Pezeshki, Mohammad; Brakel, Philémon; Zhang, Saizheng; Laurent, César; Bengio, Yoshua; Courville, Aaron

doi:10.21437/interspeech.2016-1446

Cited by 239 publications

(168 citation statements)

References 28 publications

Supporting

Mentioning

160

Contrasting

Unclassified

Order By: Relevance

“…LSTM networks have been used for modeling time series data in many fields [28]. Two common approaches are (1) encoding/decoding [40] where information is forced to be recovered by every time step, and (2) sequence-tosequence [41] where the network takes all input first, then decodes them into a different time series. The former focuses on learning the inter-frame dependencies while the latter targets at the mappings between sequences.…”

Section: Spatio-temporal Recurrent Neural Network (Strnn)mentioning

confidence: 99%

Spatio-Temporal Manifold Learning for Human Motions via Long-Horizon Modeling

Wang

Shum

et al. 2021

IEEE Trans. Visual. Comput. Graphics

View full text Add to dashboard Cite

Data-driven modeling of human motions is ubiquitous in computer graphics and computer vision applications, such as synthesizing realistic motions or recognizing actions. Recent research has shown that such problems can be approached by learning a natural motion manifold using deep learning on a large amount data, to address the shortcomings of traditional data-driven approaches. However, previous deep learning methods can be sub-optimal for two reasons. First, the skeletal information has not been fully utilized for feature extraction. Unlike images, it is difficult to define spatial proximity in skeletal motions in the way that deep networks can be applied for feature extraction. Second, motion is time-series data with strong multi-modal temporal correlations between frames. On the one hand, a frame could be followed by several candidate frames leading to different motions; on the other hand, long-range dependencies exist where a number of frames in the beginning correlate to a number of frames later. Ineffective temporal modeling would either under-estimate the multi-modality and variance, resulting in featureless mean motion or over-estimate them resulting in jittery motions, which is a major source of visual artifacts. In this paper, we propose a new deep network to tackle these challenges by creating a natural motion manifold that is versatile for many applications. The network has a new spatial component for feature extraction. It is also equipped with a new batch prediction model that predicts a large number of frames at once, such that long-term temporally-based objective functions can be employed to correctly learn the motion multi-modality and variances. With our system, long-duration motions can be predicted/synthesized using an open-loop setup where the motion retains the dynamics accurately. It can also be used for denoising corrupted motions and synthesizing new motions with given control signals. We demonstrate that our system can create superior results comparing to existing work in multiple applications.

show abstract

Section: Spatio-temporal Recurrent Neural Network (Strnn)mentioning

confidence: 99%

Spatio-Temporal Manifold Learning for Human Motions via Long-Horizon Modeling

Wang

Shum

et al. 2021

IEEE Trans. Visual. Comput. Graphics

View full text Add to dashboard Cite

show abstract

“…the phoneme label before training. CNN followed by RNN architectures have shown strong ability in dealing sequencerelated problems such as sense text recognition [22] and ASR [23]. These make the ASR network easy to train and perform better with fewer parameters.…”

Section: Layermentioning

confidence: 99%

Two-Stage Training for Chinese Dialect Recognition

Ren

Yang²,

2019

Interspeech 2019

View full text Add to dashboard Cite

In this paper, we present a two-stage language identification (LID) system based on a shallow ResNet14 followed by a simple 2-layer recurrent neural network (RNN) architecture, which was used for Xunfei (iFlyTek) Chinese Dialect Recognition Challenge 1 and won the first place among 110 teams. The system trains an acoustic model (AM) firstly with connectionist temporal classification (CTC) to recognize the given phonetic sequence annotation and then train another RNN to classify dialect category by utilizing the intermediate features as inputs from the AM. Compared with a three-stage system we further explore, our results show that the two-stage system can achieve high accuracy for Chinese dialects recognition under both short utterance and long utterance conditions with less training time.

show abstract

“…The comparable performance with temporal representation-based methods suggests the DAC could be a potential substitute for RNN in some specific areas. Actually, the RNN itself is computationally expensive and sometimes difficult to train [67]. We directly model the dependency in the feature-level, which is faster than the temporal representation of original images [22], and more effective than the adversarial face generation-based method [33].…”

Section: Results On Youtube Face Datasetmentioning

confidence: 99%

Dependency-Aware Attention Control for Unconstrained Face Recognition with Image Sets

Liu

Kumar

Yang

et al. 2018

Computer Vision – ECCV 2018

View full text Add to dashboard Cite

This paper targets the problem of image set-based face verification and identification. Unlike traditional single media (an image or video) setting, we encounter a set of heterogeneous contents containing orderless images and videos. The importance of each image is usually considered either equal or based on their independent quality assessment. How to model the relationship of orderless images within a set remains a challenge. We address this problem by formulating it as a Markov Decision Process (MDP) in the latent space. Specifically, we first present a dependency-aware attention control (DAC) network, which resorts to actor-critic reinforcement learning for sequential attention decision of each image embedding to fully exploit the rich correlation cues among the unordered images. Moreover, we introduce its sample-efficient variant with off-policy experience replay to speed up the learning process. The pose-guided representation scheme can further boost the performance at the extremes of the pose variation.

show abstract

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Cited by 239 publications

References 28 publications

Spatio-Temporal Manifold Learning for Human Motions via Long-Horizon Modeling

Spatio-Temporal Manifold Learning for Human Motions via Long-Horizon Modeling

Two-Stage Training for Chinese Dialect Recognition

Dependency-Aware Attention Control for Unconstrained Face Recognition with Image Sets

Contact Info

Product

Resources

About