2020
DOI: 10.48550/arxiv.2010.06778
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 0 publications
0
3
0
Order By: Relevance
“…The NCHLT project expands the efforts of the works in Roux et al ( 2004 ) and Badenhorst et al ( 2011 ) to enable the development of large-vocabulary speech recognition systems for practical applications. There are other existing databases, such as the Wall Street Journal (Garofolo, Graff, Paul, & Pallett, 1993 ), GlobalPhone (Schultz, 2002 ) and Google (Butryna et al, 2020 ) corpora. However, these only contain data for the English language.…”
Section: Resultsmentioning
confidence: 99%
“…The NCHLT project expands the efforts of the works in Roux et al ( 2004 ) and Badenhorst et al ( 2011 ) to enable the development of large-vocabulary speech recognition systems for practical applications. There are other existing databases, such as the Wall Street Journal (Garofolo, Graff, Paul, & Pallett, 1993 ), GlobalPhone (Schultz, 2002 ) and Google (Butryna et al, 2020 ) corpora. However, these only contain data for the English language.…”
Section: Resultsmentioning
confidence: 99%
“…The dataset will be used from (Butryna et al, 2020), which contains compilations of speech audio with .wav extension, file ID, and transcription of each audio file. There is a total of 5822 audio files with their ID and transcription, with an average duration of each audio being 3 seconds, and the total duration of all audio files is plus minus 4.85 hours.…”
Section: Data Collectionmentioning
confidence: 99%
“…Low-resource language is a language that has little data in the form of digital, or that can be processed by a computer directly. One example of this language category is the Javanese language (Butryna et al, 2020). Whisper performance evaluated using Character Error Rate (CER)/Word Error Rate (WER) on this type of language is remarkably low.…”
Section: Introductionmentioning
confidence: 99%