Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 2021
DOI: 10.18653/v1/2021.eacl-main.58
|View full text |Cite
|
Sign up to set email alerts
|

A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

Abstract: We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 26 publications
(13 citation statements)
references
References 21 publications
0
8
0
Order By: Relevance
“…In [17], a 335 h corpus for the Kazakh language was presented. As a result of the experiment, they showed that a sufficiently large training data set significantly improves the performance of a speech recognition system based on an end-to-end model compared to hybrid ones.…”
Section: Literature Review and Problem Statementmentioning
confidence: 99%
“…In [17], a 335 h corpus for the Kazakh language was presented. As a result of the experiment, they showed that a sufficiently large training data set significantly improves the performance of a speech recognition system based on an end-to-end model compared to hybrid ones.…”
Section: Literature Review and Problem Statementmentioning
confidence: 99%
“…Among the aforementioned three languages, Russian and English are considered resource-rich, i.e., a large number of annotated datasets exist [2,6,31] and extensive studies have been conducted, both in monolingual and multilingual settings [4,25,28]. On the other hand, Kazakh is considered a low-resource language, where annotated datasets and speech processing research have emerged only in recent years [19,26]. The authors of [19] presented the first crowdsourced open-source Kazakh speech corpus and conducted initial Kazakh speech recognition experiments on both DNN-HMM and E2E architectures.…”
Section: Related Workmentioning
confidence: 99%
“…On the other hand, Kazakh is considered a low-resource language, where annotated datasets and speech processing research have emerged only in recent years [19,26]. The authors of [19] presented the first crowdsourced open-source Kazakh speech corpus and conducted initial Kazakh speech recognition experiments on both DNN-HMM and E2E architectures. Similarly, the authors of [26] presented the first publicly available speech synthesis dataset for Kazakh.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations