Multi-level adaptive networks in tandem and hybrid ASR systems

Bell, P. J.; Swietojanski, Paweł; Renals, Steve

doi:10.1109/icassp.2013.6639014

Cited by 36 publications

(28 citation statements)

References 31 publications

(42 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Encouraged by the success of DNNs in the hybrid approach, researchers reevaluated the tandem approach using DNNs and achieved similar performance improvements [3,[14][15][16][17][18][19][20]. Some comparative studies were conducted for the hybrid and tandem approaches, though no evidence supports that one approach clearly outperforms the other [21,22].…”

Section: Introductionmentioning

confidence: 99%

Noisy training for deep neural networks in speech recognition

Yin

Liu

Zhang

et al. 2015

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Deep neural networks (DNNs) have gained remarkable success in speech recognition, partially attributed to the flexibility of DNN models in learning complex patterns of speech signals. This flexibility, however, may lead to serious over-fitting and hence miserable performance degradation in adverse acoustic conditions such as those with high ambient noises. We propose a noisy training approach to tackle this problem: by injecting moderate noises into the training data intentionally and randomly, more generalizable DNN models can be learned. This 'noise injection' technique, although known to the neural computation community already, has not been studied with DNNs which involve a highly complex objective function. The experiments presented in this paper confirm that the noisy training approach works well for the DNN model and can provide substantial performance improvement for DNN-based speech recognition.

show abstract

Section: Introductionmentioning

confidence: 99%

Noisy training for deep neural networks in speech recognition

Yin

Liu

Zhang

et al. 2015

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…Speech recognition was performed using a system [1] trained primarily over TED talks as used for the IWSLT 2012 ASR evaluation. The system has two passes of decoding, both using hybrid models in which HMM observation probabilities are computed using a deep neural network.…”

Section: Audio Processing and Speech Recognitionmentioning

confidence: 99%

Multi-factor segmentation for topic visualization and recommendation

Bhatt

Popescu-Belis

Habibi

et al. 2013

Proceedings of the 21st ACM International Conference on Multimedia

View full text Add to dashboard Cite

This paper presents the MUST-VIS system for the MediaMixer/VideoLectures.NET Temporal Segmentation and Annotation Grand Challenge. The system allows users to visualize a lecture as a series of segments represented by keyword clouds, with relations to other similar lectures and segments. Segmentation is performed using a multi-factor algorithm which takes advantage of the audio (through automatic speech recognition and word-based segmentation) and video (through the detection of actions such as writing on the blackboard). The similarity across segments and lectures is computed using a content-based recommendation algorithm. Overall, the graph-based representation of segment similarity appears to be a promising and cost-effective approach to navigating lecture databases.

show abstract

“…However, it was only found recently that an MLP with a large set of context-dependent targets and many hidden layers, i.e., a context-dependent deep neural network (CD-DNN), could significantly improve recognition performance [3][4][5]. Although CD-DNNs have demonstrated favourable performance in various speech recognition tasks [4][5][6][7][8][9], an existing well-trained traditional GMM-HMM has to be used for two main aspects of training: state-to-frame alignments and defining a set of tied context-dependent states [3,4].…”

Section: Introductionmentioning

confidence: 99%

Standalone training of context-dependent deep neural network acoustic models

Zhang

Woodland

2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recently, context-dependent (CD) deep neural network (DNN) hidden Markov models (HMMs) have been widely used as acoustic models for speech recognition. However, the standard method to build such models requires target training labels from a system using HMMs with Gaussian mixture model output distributions (GMMHMMs). In this paper, we introduce a method for training stateof-the-art CD-DNN-HMMs without relying on such a pre-existing system. We achieve this in two steps: build a context-independent (CI) DNN iteratively with word transcriptions, and then cluster the equivalent output distributions of the untied CD-DNN HMM states using the decision tree based state tying approach. Experiments have been performed on the Wall Street Journal corpus and the resulting system gave comparable word error rates (WER) to CD-DNNs built based on GMM-HMM alignments and state-clustering.

show abstract

Multi-level adaptive networks in tandem and hybrid ASR systems

Cited by 36 publications

References 31 publications

Noisy training for deep neural networks in speech recognition

Noisy training for deep neural networks in speech recognition

Multi-factor segmentation for topic visualization and recommendation

Standalone training of context-dependent deep neural network acoustic models

Contact Info

Product

Resources

About