2021
DOI: 10.1007/978-3-030-87802-3_54
|View full text |Cite
|
Sign up to set email alerts
|

An Equal Data Setting for Attention-Based Encoder-Decoder and HMM/DNN Models: A Case Study in Finnish ASR

Abstract: Standard end-to-end training of attention-based ASR models only uses transcribed speech. If they are compared to HMM/DNN systems, which additionally leverage a large corpus of text-only data and expert-crafted lexica, the differences in modeling cannot be disentangled from differences in data. We propose an experimental setup, where only transcribed speech is used to train both model types. To highlight the difference that text-only data can make, we use Finnish, where an expert-crafted lexicon is not needed. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

4
2

Authors

Journals

citations
Cited by 6 publications
(12 citation statements)
references
References 25 publications
(37 reference statements)
0
4
0
Order By: Relevance
“…We compared the AED models against HMM systems in an equal data setting (Rouhe et al, 2021 ): both paradigms only used transcribed speech as the training data. Since HMM systems typically leverage additional text data and expert lexica, comparing them with end-to-end models only trained on transcribed speech confounds differences in models and learning with differences in the training data.…”
Section: Modelsmentioning
confidence: 99%
“…We compared the AED models against HMM systems in an equal data setting (Rouhe et al, 2021 ): both paradigms only used transcribed speech as the training data. Since HMM systems typically leverage additional text data and expert lexica, comparing them with end-to-end models only trained on transcribed speech confounds differences in models and learning with differences in the training data.…”
Section: Modelsmentioning
confidence: 99%
“…For improved speech recognition accuracy, DNN-HMM methods are best suited for languages with limited annotated speech [14]. Also, when much more text data is available than speech data, DNN-HMM models are the preferred choice [15,16] than the modern E2E approaches. Additionally, DNN-HMM ASR models offer the advantage of easy integration into small hardware devices, enabling fast on-device speech recognition [14].…”
Section: Baby Elephant Compound Word Formed By Agglutination Of Nouns...mentioning
confidence: 99%
“…This has two benefits. Firstly, the end-to-end AED models and the HMM/DNN system can then be compared in an equal data setting [14]. Pure end-to-end models are only trained on transcribed speech.…”
Section: Language Modeling Datamentioning
confidence: 99%
“…We use grapheme-based lexica and SentencePiece Byte Pair Encoding subword units [21] in language modeling. The lexicon transducer requires special word-position dependent phone handling [14,22]. The language models are built with VariKN [23], which allows growing the modified Kneser-Ney models up to 10-gram scale.…”
Section: Hmm System Language Modelsmentioning
confidence: 99%