2020
DOI: 10.48550/arxiv.2009.10334
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

Abstract: We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 335 hours of transcribed audio comprising over 154,000 utterances spoken by participants from different regions, age groups, and gender. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
10
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(12 citation statements)
references
References 11 publications
0
10
0
Order By: Relevance
“…The Kazakh speech data set uses KSC [ 34 ], which contains about 330 h of Kazakh speech data. In this paper, speech data with different time length settings are randomly selected as fine-tuning data, and the verification and test sets of the divided standards are used.…”
Section: Methodsmentioning
confidence: 99%
“…The Kazakh speech data set uses KSC [ 34 ], which contains about 330 h of Kazakh speech data. In this paper, speech data with different time length settings are randomly selected as fine-tuning data, and the verification and test sets of the divided standards are used.…”
Section: Methodsmentioning
confidence: 99%
“…However, this 300 h of the Kazakh speech corpus is not publicly available and is small for training robust E2E ASR models. To solve these limitations, Khassanov et al [ 46 ] created an open-source speech corpus for the Kazakh language that contains approximately 332 h of transcribed audio and over 153,000 utterances uttered by people of all ages and genders, and several geographical locations. They began by extracting Kazakh textual data from a variety of sources, including legislation, electronic publications, and websites, such as Wikipedia and blogs.…”
Section: Related Workmentioning
confidence: 99%
“…In 2020, researchers at Nazarbayev University in Kazakhstan published a 330 h Kazakh speech corpus (KSC) [114], which was the largest Kazakh database at that time. Motivated by Common Voice, this database was also collected by crowdsourcing.…”
Section: Ksc/ksc2mentioning
confidence: 99%
“…However, they did not publish the platform, nor the data. In 2021 also, researchers from Nazarbayev University released the first large-scale Kazakh database, KSC [114], offering the first open benchmark for Kazakh speech recognition research. Since then, research on Kazakh has been in the fast lane.…”
Section: Ksc/ksc2mentioning
confidence: 99%
See 1 more Smart Citation