Recent advances in deep learning for speech research at Microsoft

Deng, Li; Li, Jinyu; Huang, Jui-Ting; Yao, Kaisheng; Yu, Dong; Seide, Frank; Seltzer, Michael L.; Zweig, Geoff; He, Xiaodong; Williams, Jason D.; Gong, Yifan; Acero, Alex

doi:10.1109/icassp.2013.6639345

Cited by 648 publications

(338 citation statements)

References 44 publications

Supporting

Mentioning

328

Contrasting

Unclassified

Order By: Relevance

“…Such systems are widely used by ATMs for digit recognition on checks. However, the early 2010s have seen a blossoming of DNN-based applications with highlights such as Microsoft's speech recognition system in 2011 [2] and the AlexNet system for image recognition in 2012 [3]. A brief chronology of deep learning is shown in Fig.…”

Section: Development Historymentioning

confidence: 99%

“…4 [10]. 2 Thus, techniques for efficiently performing 2 To backpropagate through each filter: (1) compute the gradient of the loss relative to the weights from the filter inputs (i.e., the forward activations) and the gradients of the loss relative to the filter outputs; (2) compute the gradient of the loss relative to the filter inputs from the filter weights and the gradients of the loss relative to the filter outputs.…”

Section: Inference Versus Trainingmentioning

confidence: 99%

“…Since the breakthrough application of DNNs to speech recognition [2] and image recognition [3], the number of applications that use DNNs has exploded. These DNNs are employed in a myriad of applications from self-driving cars [4], to detecting cancer [5] to playing complex games [6].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

et al. 2017

View full text Add to dashboard Cite

Abstract-Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.

show abstract

Section: Development Historymentioning

confidence: 99%

Section: Inference Versus Trainingmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

et al. 2017

View full text Add to dashboard Cite

show abstract

“…Following convention, each frame was multiplied by a Hamming window. Although we have experimented with many audio features, for this report we use the log of the filter bank values as described by Deng et al in [11]. Under this scenario we have a feature vector of 40 audio samples temporally aligned with the 3 Euler angles: nod (x), yaw (y) and roll (z).…”

Section: Feature Extractionmentioning

confidence: 99%

“…Recently, the Graphics Processor Unit (GPU) has enabled efficient training of Deep Neural Networks (DNNs), and within many aspects of speech and language processing, DNNs are now state of the art [10,11,12]. DNNs were proposed as a modelling strategy for head motion prediction by Ding et al [13].…”

Section: Introductionmentioning

confidence: 99%

Predicting Head Pose from Speech with a Conditional Variational Autoencoder

Greenwood¹,

Laycock²,

Matthews³

2017

Interspeech 2017

View full text Add to dashboard Cite

Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable.Rigid head motion is one visual mode that universally cooccurs with speech, and so it is a reasonable strategy to seek a transformation from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue. Natural, expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose.Recently, Long Short Term Memory (LSTM) networks have become an important tool for modelling speech and natural language tasks. We employ Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language, to model the relationship that speech has with rigid head motion. We then extend our model by conditioning with prior motion. Finally, we introduce a generative head motion model, conditioned on audio features using a Conditional Variational Autoencoder (CVAE). Each approach mitigates the problems of the one to many mapping that a speech to head pose model must accommodate.

show abstract

Microphone‐Array‐Based Speech Enhancement Using Neural Networks

Pertilä

2017

Parametric Time‐Frequency Domain Spatial Audio

View full text Add to dashboard Cite

Recent advances in deep learning for speech research at Microsoft

Cited by 648 publications

References 44 publications

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

Predicting Head Pose from Speech with a Conditional Variational Autoencoder

Microphone‐Array‐Based Speech Enhancement Using Neural Networks

Contact Info

Product

Resources

About