Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-462
|View full text |Cite
|
Sign up to set email alerts
|

Selection of Multi-Genre Broadcast Data for the Training of Automatic Speech Recognition Systems

Abstract: This paper compares schemes for the selection of multi-genre broadcast data and corresponding transcriptions for speech recognition model training. Selections of the same amount of data (700 hours) from lightly supervised alignments based on the same original subtitle transcripts are compared. Data segments were selected according to a maximum phone matched error rate between the lightly supervised decoding and the original transcript. The data selected with an improved lightly supervised system yields lower w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
6
3

Relationship

2
7

Authors

Journals

citations
Cited by 18 publications
(14 citation statements)
references
References 15 publications
(17 reference statements)
0
14
0
Order By: Relevance
“…The audio is from BBC TV programmes covering a range of genres. A 275 hour (275h) full training set was selected from 750 episodes where the sub-titles have a phone matched error rate < 40% compared to the lightly supervised output [35] which was used as training supervision. A 55 hour (55h) subset was sampled at the utterance level from the 275h set.…”
Section: Methodsmentioning
confidence: 99%
“…The audio is from BBC TV programmes covering a range of genres. A 275 hour (275h) full training set was selected from 750 episodes where the sub-titles have a phone matched error rate < 40% compared to the lightly supervised output [35] which was used as training supervision. A 55 hour (55h) subset was sampled at the utterance level from the 275h set.…”
Section: Methodsmentioning
confidence: 99%
“…A total of 375 hours of audio data with associated subtitles is available for acoustic model training. Lightly supervised decoding and selection was used to extract 275 hours for training [34,35,8]. The reference segmentation was used with automatic speaker clustering resulting in 192,209 utterances and 13,467 speaker clusters.…”
Section: Methodsmentioning
confidence: 99%
“…The 2017 Multi-Genre Broadcast (MGB-3) English task [34] comprises audio recordings from television programs of a variety of genres. Lightly supervised decoding and selection [35] was used to extract a training set with 275 hours of data, out of the full 375 hours of available audio data. The 5.5 hours dev17b test set was used, and was divided into segments using a DNNbased segmenter [36] that was trained on the MGB-3 data.…”
Section: Sequence Posterior Targetsmentioning
confidence: 99%