KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos

Lakomkin, Egor; Magg, Sven; Weber, Cornelius; Wermter, Stefan

doi:10.18653/v1/d18-2016

Cited by 8 publications

(6 citation statements)

References 10 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Voice is another trait in which samples are easily found in internet videos; in fact, these videos might better simulate the system's deployments than the controlled environments usually used in previous works. Lakomkin, Magg, Weber, and Wermter (2018) describe how to crawl and collect a dataset for speech recognition, a process that could be adapted for collecting voice recognition data. A similar argument can be made to stylometry-based continuous authentication (Narayanan et al, 2012).…”

Section: Mining Massive Datasetsmentioning

confidence: 99%

Continuous authentication using biometrics: An advanced review

Dahia

Jesus

Segundo

2020

WIREs Data Min & Knowl

View full text Add to dashboard Cite

The shortcomings of conventional access control systems for high‐security environments have led to the concert of continuous authentication. Contrary to traditional verification, in which users are authenticated only once at the start of their session, continuous authentication systems regularly check users' identities to prevent hijackings. The challenges in this area involve balancing the security of protected assets by quickly detecting intruders with the system usability for genuine users. Biometric recognition plays a major role within this context, as it is the main way to assure that users are who they claim to be. A comparative analysis of the latest works revealed different aspects of this problem. First, some biometrics traits among those applied for continuous authentication are more suitable for this task than others. Second, systems combining multiple traits have advantages over those relying on a single one. Finally, many works fail to report proper evaluation metrics. With this in mind, we were able to identify new opportunities for researchers in the field. We highlight the potential for mining new datasets on the internet, which would benefit validation and benchmarking, and how recent deep learning techniques could address some of the open challenges in the area. This article is categorized under: Technologies > Prediction Technologies > Machine Learning Application Areas > Science and Technology

show abstract

Section: Mining Massive Datasetsmentioning

confidence: 99%

Continuous authentication using biometrics: An advanced review

Dahia

Jesus

Segundo

2020

WIREs Data Min & Knowl

View full text Add to dashboard Cite

show abstract

“…As a further test, we trained an ASR on~7000 hours of curated YouTube utterances and saw better WER performance than Google's default ASR model. We have found similar approaches in utilizing data from YouTube [12]. Our approach is more general in that it can search for and extract…”

Section: Data Curation Pipelinementioning

confidence: 83%

“…TED talks) and how-to videos to train neural nets to pick out a single speaker from a noisy environment (cocktail party effect). Reference [12] introduces a crawler for YouTube to curate training dataset for ASR and demonstrates a 40% improvement in Word Error Rate (WER) on the Wall Street Journal test dataset. In [13], the authors address the problem of operating ASRs in a wide range of developing languages, such as Swahili, by proposing to automatically scrape audio from YouTube and Voice of America and use ASR system confidence scores as the primary metric for the model components.…”

Section: Related Workmentioning

confidence: 99%

Automated Techniques for Creating Speech Corpora from Public Data Sources for ML Training

Drabeck¹,

Ramanan²,

Woo³

et al. 2020

IJMLC

View full text Add to dashboard Cite

For machine learning (ML) to work well, there is a need for large amounts of good quality training data. Obtaining such data is often the key bottleneck for the entire ML development process. Using humans to do explicit collection has been the main approach, but this tends to be expensive and time-consuming. Therefore, there is significant interest in creating alternative data collection techniques. We explore these alternative data collection techniques in the context of speech data in this paper. We were initially motivated by the problem of wake word engine training, where we need a large number of utterances for specific wake words. Given that there are already large public repositories of media data (e.g., YouTube, DailyMotion), we were curious as to how feasible it is to find the utterances that we need. Our results are encouraging as we found many different types of words can readily be found and downloaded in the quantity and quality needed to create training corpora for DL training. Usually > 30% of the found words are suitable for corpus creation. Greater than 80% of the top 10,000 ranks words and > 50% of the top 20,000 words we selected easily produced > 5000 found words, which is sufficient to train a high quality Wake Word Engine. Besides general words, we specifically looked for words used in wake word engine construction such as Name/Place/Product Name. Here, again, we find most common names/places/products return more than a sufficient number of words for corpus creation. Only uncommon names and places (like Atticus or Maximus) are difficult to find in sufficient quantities for corpus creation. We demonstrate a wake word engine trained from words we found in YouTube has the equivalent performance to one trained with traditional human collected words. Even though we were focused on wake words, our approach is general. It can be applied to create speech corpus for various purposes.

show abstract

“…The work [26] introduces the "island of confidence" filtering heuristic to extract useful speech segments with transcripts from Youtube videos. Lakomkin et al [27] propose a set of filtering rules to construct speech datasets from Youtube videos and auto-sync captions. these methods generally require a well-performed ASR model to start-up.…”

Section: Related Workmentioning

confidence: 99%

Weakly Supervised Construction of ASR Systems with Massive Video Data

Cheng¹,

Wang²,

Huang³

et al. 2020

Preprint

View full text Add to dashboard Cite

Building Automatic Speech Recognition (ASR) systems from scratch is significantly challenging, mostly due to the timeconsuming and financially-expensive process of annotating a large amount of audio data with transcripts. Although several unsupervised pre-training models have been proposed, applying such models directly might still be sub-optimal if more labeled, training data could be obtained without a large cost.In this paper, we present a weakly supervised framework for constructing ASR systems with massive video data. As videos often contain human-speech audios aligned with subtitles, we consider videos as an important knowledge source, and propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR). The underlying ASR model can be fine-tuned to fit any domain-specific target training datasets after weakly supervised pre-training. Extensive experiments show that our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition. 1

show abstract

KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos

Cited by 8 publications

References 10 publications

Continuous authentication using biometrics: An advanced review

Continuous authentication using biometrics: An advanced review

Automated Techniques for Creating Speech Corpora from Public Data Sources for ML Training

Weakly Supervised Construction of ASR Systems with Massive Video Data

Contact Info

Product

Resources

About