Semi-supervised training of Deep Neural Networks

Veselý, Karel; Hannemann, Mirko; Burget, Lukáš

doi:10.1109/asru.2013.6707741

Cited by 253 publications

(299 citation statements)

References 21 publications

Supporting

Mentioning

289

Contrasting

Unclassified

Order By: Relevance

“…The logarithm diverges if the argument goes to zero, i.e., if the correct word sequence has zero probability in decoding. To avoid numerical issues with such utterances, we use the frame rejection heuristic described in [13], i.e., discard frames with state occupancy close to zero, γ (den) ut (s) < (here, = 0.001). No regularization (for example, 2-regularization around the initial network) or smoothing such as the H-criterion [12] is used in this paper as there is no empirical evidence for overfitting.…”

Section: Deep Neural Network In Asrmentioning

confidence: 99%

“…Directly minimizing the word error is a hard optimization problem and thus, several surrogates have been proposed, including maximum mutual information (MMI) [8], minimum phone error (MPE) [9] or state-level minimum Bayes risk (sMBR) [10]. Good gains have recently been reported for sequence training of DNNs [10,11,12,13].…”

Section: Introductionmentioning

confidence: 99%

“…It is not immediately clear how to implement sequence training using a stochastic optimization process [12,13]. For stable SGD optimization, the observations need to be randomized (shuffled) [12,14].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Asynchronous stochastic optimization for sequence training of deep neural networks

Heigold

McDermott

Vanhoucke

et al. 2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic optimization, arising from significant asynchrony in model updates under massive parallelization, and limited data shuffling due to utterance-chunked processing. We analyze the impact of these two issues on the efficiency and performance of sequence training. In particular, we suggest a framework to formalize the reasoning about the asynchrony and present experimental results on both small and large scale Voice Search tasks to validate the effectiveness and efficiency of asynchronous stochastic optimization.

show abstract

Section: Deep Neural Network In Asrmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Asynchronous stochastic optimization for sequence training of deep neural networks

Heigold

McDermott

Vanhoucke

et al. 2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Our experience suggests that around 2% absolute WER is gained by expanding the training set to 700hrs by increasing the MER threshold to 40%. We also show the results from applying a standard DNN training recipe with CE training followed by sMBR sequence training [20]. Two iterations of CE training are used, with state alignments regenerated after the first iteration.…”

Section: Baseline Systemmentioning

confidence: 99%

The MGB challenge: Evaluating multi-genre broadcast media recognition

Bell

Gales

Hain

et al. 2015

2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)

125

151

View full text Add to dashboard Cite

This paper describes the Multi-Genre Broadcast (MGB) Challenge at ASRU 2015, an evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings. The challenge training data covered the whole range of seven weeks BBC TV output across four channels, resulting in about 1,600 hours of broadcast audio. In addition several hundred million words of BBC subtitle text was provided for language modelling. A novel aspect of the evaluation was the exploration of speech recognition and speaker diarization in a longitudinal setting -i.e. recognition of several episodes of the same show, and speaker diarization across these episodes, linking speakers. The longitudinal tasks also offered the opportunity for systems to make use of supplied metadata including show title, genre tag, and date/time of transmission. This paper describes the task data and evaluation process used in the MGB challenge, and summarises the results obtained.

show abstract

“…The deep neural network (DNN) has 7 layers, each with 2048 units and it employs 5 frames of left and right context in the input frame (i.e., 11 × 40 = 440 units). It is trained using standard restricted Boltzmann machine pre-training, cross entropy training and sequence discriminative training using the state-level minimum Bayes' risk criterion (Veselý et al, 2013).…”

Section: Speech Recognitionmentioning

confidence: 99%

The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes

Barker

Marxer

Vincent

et al. 2017

Computer Speech & Language

101

View full text Add to dashboard Cite

This paper presents the design and outcomes of the CHiME-3 challenge, the first open speech recognition evaluation designed to target the increasingly relevant multichannel, mobile-device speech recognition scenario. The paper serves two purposes. First, it provides a definitive reference for the challenge, including full descriptions of the task design, data capture and baseline systems along with a description and evaluation of the 26 systems that were submitted. The best systems re-engineered every stage of the baseline resulting in reductions in word error rate from 33.4% to as low as 5.8%. By comparing across systems, techniques that are essential for strong performance are identified. Second, the paper considers the problem of drawing conclusions from evaluations that use speech directly recorded in noisy environments. The degree of challenge presented by the resulting material is hard to control and hard to fully characterise. We attempt to dissect the various 'axes of difficulty' by correlating various estimated signal properties with typical system performance on a per session and per utterance basis. We find strong evidence of a dependence on signal-to-noise ratio and channel quality. Systems are less sensitive to variations in the degree of speaker motion. The paper concludes by discussing the outcomes of CHiME-3 in relation to the design of future mobile speech recognition evaluations.

show abstract

Semi-supervised training of Deep Neural Networks

Cited by 253 publications

References 21 publications

Asynchronous stochastic optimization for sequence training of deep neural networks

Asynchronous stochastic optimization for sequence training of deep neural networks

The MGB challenge: Evaluating multi-genre broadcast media recognition

The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes

Contact Info

Product

Resources

About