2020
DOI: 10.3390/sym12020290
|View full text |Cite
|
Sign up to set email alerts
|

Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results

Abstract: To build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited resources for ASR. Therefore, development of a symmetric Turkish transcribed speech corpus according to the high resources languages corpora is crucial for improving and promoting Turkish speech recognition activitie… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 29 publications
0
6
0
Order By: Relevance
“…The dataset consists of English braille symbols, specifically English Grade 1 braille which is un-contracted, meaning there is a specific braille symbol for each letter in the English alphabet as well as for punctuation. Just as there are insufficient datasets for speech-to-text recognition [25], multiple online searches showed a lack of braille character datasets that fit the needs of this study. The few that do exist are either in foreign languages, have broken links, are not publicly available, or only contain printed images of braille, which are not suitable for training a recognition system that can learn imperfect and blurry images.…”
Section: Dataset and Setupmentioning
confidence: 96%
“…The dataset consists of English braille symbols, specifically English Grade 1 braille which is un-contracted, meaning there is a specific braille symbol for each letter in the English alphabet as well as for punctuation. Just as there are insufficient datasets for speech-to-text recognition [25], multiple online searches showed a lack of braille character datasets that fit the needs of this study. The few that do exist are either in foreign languages, have broken links, are not publicly available, or only contain printed images of braille, which are not suitable for training a recognition system that can learn imperfect and blurry images.…”
Section: Dataset and Setupmentioning
confidence: 96%
“…The Turkish speech data set [51] prepared by Bogazici University in 2012 and presented by LDC, the Linguistic Data Consortium, and METU 1.0 sound corpus provided by METU were used for training and testing processes of the ASR system [52]. Also, a new corpus was used to exhibit the performance of the method we propose more clearly (HS Corpus) [53]. The corpus information is given in Table 1.…”
Section: Preparation Of the Corpusmentioning
confidence: 99%
“…Arısoy et al (2009) report a larger dataset of broadcast news, and a dataset of 38 000 hours of call center recordings is reported by Haznedaroğlu and Arslan (2014). A recent speech corpus, consisting of movies with aligned subtitles, and read speech samples are reported by Polat and Oyucu (2020). The availability of corpora listed above is unclear.…”
Section: Speech and Multi-modal Corporamentioning
confidence: 99%