Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations 2018
DOI: 10.18653/v1/d18-2016
|View full text |Cite
|
Sign up to set email alerts
|

KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos

Abstract: In this paper, we describe KT-Speech-Crawler: an approach for automatic dataset construction for speech recognition by crawling YouTube videos. We outline several filtering and postprocessing steps, which extract samples that can be used for training end-to-end neural speech recognition systems. In our experiments, we demonstrate that a single-core version of the crawler can obtain around 150 hours of transcribed speech within a day, containing an estimated 3.5% word error rate in the transcriptions.Automatica… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 10 publications
(12 reference statements)
0
5
0
Order By: Relevance
“…Voice is another trait in which samples are easily found in internet videos; in fact, these videos might better simulate the system's deployments than the controlled environments usually used in previous works. Lakomkin, Magg, Weber, and Wermter (2018) describe how to crawl and collect a dataset for speech recognition, a process that could be adapted for collecting voice recognition data. A similar argument can be made to stylometry-based continuous authentication (Narayanan et al, 2012).…”
Section: Mining Massive Datasetsmentioning
confidence: 99%
“…Voice is another trait in which samples are easily found in internet videos; in fact, these videos might better simulate the system's deployments than the controlled environments usually used in previous works. Lakomkin, Magg, Weber, and Wermter (2018) describe how to crawl and collect a dataset for speech recognition, a process that could be adapted for collecting voice recognition data. A similar argument can be made to stylometry-based continuous authentication (Narayanan et al, 2012).…”
Section: Mining Massive Datasetsmentioning
confidence: 99%
“…As a further test, we trained an ASR on~7000 hours of curated YouTube utterances and saw better WER performance than Google's default ASR model. We have found similar approaches in utilizing data from YouTube [12]. Our approach is more general in that it can search for and extract…”
Section: Data Curation Pipelinementioning
confidence: 83%
“…TED talks) and how-to videos to train neural nets to pick out a single speaker from a noisy environment (cocktail party effect). Reference [12] introduces a crawler for YouTube to curate training dataset for ASR and demonstrates a 40% improvement in Word Error Rate (WER) on the Wall Street Journal test dataset. In [13], the authors address the problem of operating ASRs in a wide range of developing languages, such as Swahili, by proposing to automatically scrape audio from YouTube and Voice of America and use ASR system confidence scores as the primary metric for the model components.…”
Section: Related Workmentioning
confidence: 99%
“…The work [26] introduces the "island of confidence" filtering heuristic to extract useful speech segments with transcripts from Youtube videos. Lakomkin et al [27] propose a set of filtering rules to construct speech datasets from Youtube videos and auto-sync captions. these methods generally require a well-performed ASR model to start-up.…”
Section: Related Workmentioning
confidence: 99%