Selection of Multi-Genre Broadcast Data for the Training of Automatic Speech Recognition Systems

Lanchantin, Pierre; Gales, Mark J. F.; Karanasou, Penny; Liu, Xunying; Qian, Yanman; Wang, Linlin; Woodland, Philip C.; Zhang, Chao

doi:10.21437/interspeech.2016-462

Cited by 18 publications

(14 citation statements)

References 15 publications

(17 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The audio is from BBC TV programmes covering a range of genres. A 275 hour (275h) full training set was selected from 750 episodes where the sub-titles have a phone matched error rate < 40% compared to the lightly supervised output [35] which was used as training supervision. A 55 hour (55h) subset was sampled at the utterance level from the 275h set.…”

Section: Methodsmentioning

confidence: 99%

High Order Recurrent Neural Networks for Acoustic Modelling

Zhang

Woodland

2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Vanishing long-term gradients are a major issue in training standard recurrent neural networks (RNNs), which can be alleviated by long short-term memory (LSTM) models with memory cells. However, the extra parameters associated with the memory cells mean an LSTM layer has four times as many parameters as an RNN with the same hidden vector size. This paper addresses the vanishing gradient problem using a high order RNN (HORNN) which has additional connections from multiple previous time steps. Speech recognition experiments using British English multi-genre broadcast (MGB3) data showed that the proposed HORNN architectures for rectified linear unit and sigmoid activation functions reduced word error rates (WER) by 4.2% and 6.3% over the corresponding RNNs, and gave similar WERs to a (projected) LSTM while using only 20%-50% of the recurrent layer parameters and computation.

show abstract

Section: Methodsmentioning

confidence: 99%

High Order Recurrent Neural Networks for Acoustic Modelling

Zhang

Woodland

2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…A total of 375 hours of audio data with associated subtitles is available for acoustic model training. Lightly supervised decoding and selection was used to extract 275 hours for training [34,35,8]. The reference segmentation was used with automatic speaker clustering resulting in 192,209 utterances and 13,467 speaker clusters.…”

Section: Methodsmentioning

confidence: 99%

Speaker Adaptation and Adaptive Training for Jointly Optimised Tandem Systems

et al. 2018

View full text Add to dashboard Cite

Speaker independent (SI) Tandem systems trained by joint optimisation of bottleneck (BN) deep neural networks (DNNs) and Gaussian mixture models (GMMs) have been found to produce similar word error rates (WERs) to Hybrid DNN systems. A key advantage of using GMMs is that existing speaker adaptation methods, such as maximum likelihood linear regression (MLLR), can be used which to account for diverse speaker variations and improve system robustness. This paper investigates speaker adaptation and adaptive training (SAT) schemes for jointly optimised Tandem systems. Adaptation techniques investigated include constrained MLLR (CMLLR) transforms based on BN features for SAT as well as MLLR and parameterised sigmoid functions for unsupervised test-time adaptation. Experiments using English multi-genre broadcast (MGB3) data show that CMLLR SAT yields a 4% relative WER reduction over jointly trained Tandem and Hybrid SI systems, and further reductions in WER are obtained by system combination.

show abstract

“…The 2017 Multi-Genre Broadcast (MGB-3) English task [34] comprises audio recordings from television programs of a variety of genres. Lightly supervised decoding and selection [35] was used to extract a training set with 275 hours of data, out of the full 375 hours of available audio data. The 5.5 hours dev17b test set was used, and was divided into segments using a DNNbased segmenter [36] that was trained on the MGB-3 data.…”

Section: Sequence Posterior Targetsmentioning

confidence: 99%

General Sequence Teacher–Student Learning

Wong

Gales

Wang

2019

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

In automatic speech recognition, performance gains can often be obtained by combining an ensemble of multiple models. However, this can be computationally expensive when performing recognition. Teacher-student learning alleviates this cost by training a single student model to emulate the combined ensemble behaviour. Only this student needs to be used for recognition. Previously investigated teacher-student criteria often limit the forms of diversity allowed in the ensemble, and only propagate information from the teachers to the student at the frame level. This paper addresses both of these issues by examining teacher-student learning within a sequence-level framework, and assessing the flexibility that these approaches offer. Various sequence-level teacher-student criteria are examined in this work, to propagate sequence posterior information. A training criterion based on the KL-divergence between context-dependent state sequence posteriors is proposed that allows for a diversity of state cluster sets to be present in the ensemble. This criterion is shown to be an upper bound to a more general KL-divergence between word sequence posteriors, which places even fewer restrictions on the ensemble diversity, but whose gradient can be expensive to compute. These methods are evaluated on the AMI meeting transcription and MGB-3 television broadcast audio tasks.

show abstract

Selection of Multi-Genre Broadcast Data for the Training of Automatic Speech Recognition Systems

Cited by 18 publications

References 15 publications

High Order Recurrent Neural Networks for Acoustic Modelling

High Order Recurrent Neural Networks for Acoustic Modelling

Speaker Adaptation and Adaptive Training for Jointly Optimised Tandem Systems

General Sequence Teacher–Student Learning

Contact Info

Product

Resources

About