2013 IEEE International Conference on Acoustics, Speech and Signal Processing 2013
DOI: 10.1109/icassp.2013.6639345
|View full text |Cite
|
Sign up to set email alerts
|

Recent advances in deep learning for speech research at Microsoft

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
328
1
6

Year Published

2017
2017
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 648 publications
(338 citation statements)
references
References 44 publications
3
328
1
6
Order By: Relevance
“…Such systems are widely used by ATMs for digit recognition on checks. However, the early 2010s have seen a blossoming of DNN-based applications with highlights such as Microsoft's speech recognition system in 2011 [2] and the AlexNet system for image recognition in 2012 [3]. A brief chronology of deep learning is shown in Fig.…”
Section: Development Historymentioning
confidence: 99%
See 2 more Smart Citations
“…Such systems are widely used by ATMs for digit recognition on checks. However, the early 2010s have seen a blossoming of DNN-based applications with highlights such as Microsoft's speech recognition system in 2011 [2] and the AlexNet system for image recognition in 2012 [3]. A brief chronology of deep learning is shown in Fig.…”
Section: Development Historymentioning
confidence: 99%
“…4 [10]. 2 Thus, techniques for efficiently performing 2 To backpropagate through each filter: (1) compute the gradient of the loss relative to the weights from the filter inputs (i.e., the forward activations) and the gradients of the loss relative to the filter outputs; (2) compute the gradient of the loss relative to the filter inputs from the filter weights and the gradients of the loss relative to the filter outputs.…”
Section: Inference Versus Trainingmentioning
confidence: 99%
See 1 more Smart Citation
“…Following convention, each frame was multiplied by a Hamming window. Although we have experimented with many audio features, for this report we use the log of the filter bank values as described by Deng et al in [11]. Under this scenario we have a feature vector of 40 audio samples temporally aligned with the 3 Euler angles: nod (x), yaw (y) and roll (z).…”
Section: Feature Extractionmentioning
confidence: 99%
“…Recently, the Graphics Processor Unit (GPU) has enabled efficient training of Deep Neural Networks (DNNs), and within many aspects of speech and language processing, DNNs are now state of the art [10,11,12]. DNNs were proposed as a modelling strategy for head motion prediction by Ding et al [13].…”
Section: Introductionmentioning
confidence: 99%