The Microsoft 2017 Conversational Speech Recognition System

Xiong, Wei; Wu, Lingfeng; Alleva, Fil; Droppo, Jasha; Huang, Xuedong; Stolcke, Andreas

doi:10.1109/icassp.2018.8461870

Cited by 379 publications

(246 citation statements)

References 49 publications

Supporting

Mentioning

227

Contrasting

Unclassified

Order By: Relevance

“…Automatic Speech Recognition (ASR) is a key technology for the task of automatic analysis of any kind of spoken speech, e.g., phone calls or meetings. For scenarios of relatively clean speech, e.g., recordings of telephone speech or audio books, ASR technologies have improved drastically over the recent years [1]. More realistic scenarios like spontaneous speech or meetings with multiple participants in many cases require the ASR system to recognize the speech of multiple speakers simultaneously.…”

Section: Introductionmentioning

confidence: 99%

End-to-End Training of Time Domain Audio Separation and Recognition

Neumann

Kinoshita

Drude

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multispeaker speech recognition. However, up until now, state-of-theart neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0 % on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.

show abstract

Section: Introductionmentioning

confidence: 99%

End-to-End Training of Time Domain Audio Separation and Recognition

Neumann

Kinoshita

Drude

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The later state-of-theart-model, DenseNet [10], also uses SC and BN. Besides success in computer vision, ResNet has also performed well in acoustic models for speech recognition [11,12].…”

Section: Introductionmentioning

confidence: 99%

SNDCNN: Self-Normalizing Deep CNNs with Scaled Exponential Linear Units for Speech Recognition

Huang

Liu

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Very deep CNNs achieve state-of-the-art results in both computer vision and speech recognition, but are difficult to train. The most popular way to train very deep CNNs is to use shortcut connections (SC) together with batch normalization (BN). Inspired by Self-Normalizing Neural Networks [1], we propose the self-normalizing deep CNN (SNDCNN) based acoustic model topology, by removing the SC/BN and replacing the typical RELU activations with scaled exponential linear unit (SELU) in ResNet-50. SELU activations make the network self-normalizing and remove the need for both shortcut connections and batch normalization. Compared to ResNet-50, we can achieve the same or lower word error rate (WER) while at the same time improving both training and inference speed by 60%-80%. We also explore other model inference optimizations to further reduce latency for production use.

show abstract

“…Table 1 for descriptions. Improvements in speech recognition [28], dialogue generation [12,24], emotional speech synthesis [17,26] and computer graphics have made it possible to design more expressive and realistic conversational agents. However, there are still many uncertainties in how best to design embodied agents.…”

Section: Introductionmentioning

confidence: 99%

A High-Fidelity Open Embodied Avatar with Lip Syncing and Expression Capabilities

Aneja

McDuff

Shah

2019

2019 International Conference on Multimodal Interaction

View full text Add to dashboard Cite

Embodied avatars as virtual agents have many applications and provide benefits over disembodied agents, allowing nonverbal social and interactional cues to be leveraged, in a similar manner to how humans interact with each other. We present an open embodied avatar built upon the Unreal Engine that can be controlled via a simple python programming interface. The avatar has lip syncing (phoneme control), head gesture and facial expression (using either facial action units or cardinal emotion categories) capabilities. We release code and models to illustrate how the avatar can be controlled like a puppet or used to create a simple conversational agent using public application programming interfaces (APIs). GITHUB link: https://github.com/danmcduff/AvatarSim Figure 1: We present an open embodied avatar with lip syncing and expression capabilities that can be controlled via simple python interface. We provide examples of how to combine this with publicly available speech and dialogue APIs to construct a conversational embodied agent.

show abstract

The Microsoft 2017 Conversational Speech Recognition System

Cited by 379 publications

References 49 publications

End-to-End Training of Time Domain Audio Separation and Recognition

End-to-End Training of Time Domain Audio Separation and Recognition

SNDCNN: Self-Normalizing Deep CNNs with Scaled Exponential Linear Units for Speech Recognition

A High-Fidelity Open Embodied Avatar with Lip Syncing and Expression Capabilities

Contact Info

Product

Resources

About