Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021
DOI: 10.18653/v1/2021.findings-acl.447
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights

Abstract: Automatic speech recognition (ASR) in Sanskrit is interesting, owing to the various linguistic peculiarities present in the language. The Sanskrit language is lexically productive, undergoes euphonic assimilation of phones at the word boundaries and exhibits variations in spelling conventions and in pronunciations. In this work, we propose the first large scale study of automatic speech recognition (ASR) in Sanskrit, with an emphasis on the impact of unit selection in Sanskrit ASR. In this work, we release a 7… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
1
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(5 citation statements)
references
References 16 publications
(13 reference statements)
0
1
0
Order By: Relevance
“…Subword token-based language modeling has been proposed for applications in speech recognition [7,11,12,22,23], statistical machine translation [24], neural machine translations [20,21] and handwriting recognition [25]. The choice of subword tokens used in language modeling impacts the performance of the model on many downstream tasks [26] including speech recognition [7].…”
Section: Subword Tokenization Algorithmsmentioning
confidence: 99%
See 1 more Smart Citation
“…Subword token-based language modeling has been proposed for applications in speech recognition [7,11,12,22,23], statistical machine translation [24], neural machine translations [20,21] and handwriting recognition [25]. The choice of subword tokens used in language modeling impacts the performance of the model on many downstream tasks [26] including speech recognition [7].…”
Section: Subword Tokenization Algorithmsmentioning
confidence: 99%
“…Orthographic syllable-based tokenization of text was proposed by Kunchukuttan et al for statistical machine translation applications [24]. Splitting the tokens based on vowels and adjacent consonants, named vowel segmentation, was proposed by Adiga et al and employed in the context of Sanskrit speech recognition [22]. These two methods segment text into syllable-like units at valid pronunciation boundaries.…”
Section: Syllable Tokensmentioning
confidence: 99%
“…Most of these datasets such as MUCS (Diwan et al 2021), MSR (Srivastava et al 2018), Gramvaani (Bhanushali et al 2022), Crowdsourced Speech Corpora (CSC) (Kjartansson et al 2018a), IISC-MILE Corpus (Ayyavu, Pilar, and G 2022), Crowdsourced Multispeaker Speech Dataset (CSD) (He et al 2020), Kashmiri Data Corpus (KDC) 3 , Common Voice, IIIT-H Indic Speech Databases (ISD) (Prahallad et al 2012), Hindi-Tamil ASR Challenge 4 (HTC), Vāksañcayah . (VAC) (Adiga et al 2021), IIIT-H Telugu Corpus (Ganesh et al 2021) (ITH), IITB Marathi Corpus (IMC) (Abraham et al 2020a) and SMC Malayalam Corpus 5 contain only ASR data. Further, most of them support very few languages.…”
Section: Related Workmentioning
confidence: 99%
“…It has been demonstrated in literature that subword based models are better in capturing language features for morphologically complex languages [18]. Syllables serve as a good choice of subwords for practical applications including automatic speech recognition [67] that takes care of OOV scenarios. Orthographic syllable units have proven to be more effective in statistical machine translation, than other basic units (word, morpheme and character) when trained over small parallel corpora [17].…”
Section: A Syllable Based Language Modelingmentioning
confidence: 99%