Yajie Miao scite author profile

The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs). Despite this progress, building a new ASR system remains a challenging task, requiring various resources, multiple training stages and significant expertise. This paper presents our Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems. Acoustic modeling in Eesen involves learning a single recurrent neural network (RNN) predicting contextindependent targets (phonemes or characters). To remove the need for pre-generated frame labels, we adopt the connectionist temporal classification (CTC) objective function to infer the alignments between speech and label sequences. A distinctive feature of Eesen is a generalized decoding approach based on weighted finite-state transducers (WFSTs), which enables the efficient incorporation of lexicons and language models into CTC decoding. Experiments show that compared with the standard hybrid DNN systems, Eesen achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

show abstract

Extracting deep bottleneck features using stacked auto-encoders

Gehring

et al. 2013

View full text Add to dashboard Cite

In this work, a novel training scheme for generating bottleneck features from deep neural networks is proposed. A stack of denoising auto-encoders is first trained in a layer-wise, unsupervised manner. Afterwards, the bottleneck layer and an additional layer are added and the whole network is fine-tuned to predict target phoneme states. We perform experiments on a Cantonese conversational telephone speech corpus and find that increasing the number of autoencoders in the network produces more useful features, but requires pre-training, especially when little training data is available. Using more unlabeled data for pre-training only yields additional gains. Evaluations on larger datasets and on different system setups demonstrate the general applicability of our approach. In terms of word error rate, relative improvements of 9.2% (Cantonese, ML training), 9.3% (Tagalog, BMMI-SAT training), 12% (Tagalog, confusion network combinations with MFCCs), and 8.7% (Switchboard) are achieved.

show abstract

Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors

Miao

Zhang

Metze

2015

IEEE/ACM Trans. Audio Speech Lang. Process.

116

View full text Add to dashboard Cite

Deep maxout networks for low-resource speech recognition

Miao

Metze

Rawat

2013

View full text Add to dashboard Cite

As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show stateof-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks. Our focus is on the particular advantage of DMNs under low-resource conditions with limited transcribed speech. We extend DMNs to hybrid and bottleneck feature systems, and explore optimal network structures (number of maxout layers, pooling strategy, etc) for both setups. On the newly released Babel corpus, behaviors of DMNs are extensively studied under different levels of data availability. Experiments show that DMNs improve low-resource speech recognition significantly. Moreover, DMNs introduce sparsity to their hidden activations and thus can act as sparse feature extractors.

show abstract

Simplifying long short-term memory acoustic models for fast training and decoding

Miao

Wang

et al. 2016

View full text Add to dashboard Cite

An empirical exploration of CTC acoustic models

Miao

Gowayyed

et al. 2016

View full text Add to dashboard Cite

Improvements to speaker adaptive training of deep neural networks

Miao

Jiang

Zhang

et al. 2014

View full text Add to dashboard Cite

Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on word error rates (WERs). In this paper, we present different methods to further improve and extend SAT-DNN. First, we conduct detailed analysis to investigate i-vector extractor training and flexible feature fusion. Second, the SAT-DNN approach is extended to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNNbased feature extraction. Third, for transcribing multimedia data, we enrich the i-vector representation with global speaker attributes (age, gender, etc.) obtained automatically from the video signal. On a collection of instructional videos, incorporation of the additional visual features is observed to boost the recognition accuracy of SAT-DNN.

show abstract

Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach

Miao¹,

Metze²

2016

View full text Add to dashboard Cite

Automatic speech recognition (ASR) on video data naturally has access to two modalities: audio and video. In previous work, audiovisual ASR, which leverages visual features to help ASR, has been explored on restricted domains of videos. This paper aims to extend this idea to open-domain videos, for example videos uploaded to YouTube. We achieve this by adopting a unified deep learning approach. First, for the visual features, we propose to apply segment-(utterance-) level features, instead of highly restrictive frame-level features. These visual features are extracted using deep learning architectures which have been pre-trained on computer vision tasks, e.g., object recognition and scene labeling. Second, the visual features are incorporated into ASR under deep learning based acoustic modeling. In addition to simple feature concatenation, we also apply an adaptive training framework to incorporate visual features in a more flexible way. On a challenging video transcribing task, audiovisual ASR using our proposed approach gets notable improvements in terms of word error rates (WERs), compared to ASR merely using speech features.

show abstract

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yajie Miao

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

Extracting deep bottleneck features using stacked auto-encoders

Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors

Deep maxout networks for low-resource speech recognition

Simplifying long short-term memory acoustic models for fast training and decoding

An empirical exploration of CTC acoustic models

Improvements to speaker adaptive training of deep neural networks

Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach

Contact Info

Product

Resources

About