Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-11449
|View full text |Cite
|
Sign up to set email alerts
|

Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion

Abstract: Multilingual speech recognition has drawn significant attention as an effective way to compensate data scarcity for lowresource languages. End-to-end (e2e) modelling is preferred over conventional hybrid systems, mainly because of no lexicon requirement. However, hybrid DNN-HMMs still outperform e2e models in limited data scenarios. Furthermore, the problem of manual lexicon creation has been alleviated by publicly available trained models of grapheme-to-phoneme (G2P) and text to IPA transliteration for a lot … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 31 publications
0
4
0
Order By: Relevance
“…So, the mapping models accuracy is calculated for different values of n where n represents the number of most probable classes. Though the accuracy increases with increasing value of n, rate of change is not as much as observed in case of phonemes by [24] which implies that the performance of mapping model in case of phoneme based hybrid DNN-HMM systems has been better than that for e2e systems. Since the mapping models are trained using posterior distributions of ASR outputs, one potential reason could be the detrimental affect of speech recognition systems on the training of mapping models.…”
Section: Mapping Modelsmentioning
confidence: 84%
See 3 more Smart Citations
“…So, the mapping models accuracy is calculated for different values of n where n represents the number of most probable classes. Though the accuracy increases with increasing value of n, rate of change is not as much as observed in case of phonemes by [24] which implies that the performance of mapping model in case of phoneme based hybrid DNN-HMM systems has been better than that for e2e systems. Since the mapping models are trained using posterior distributions of ASR outputs, one potential reason could be the detrimental affect of speech recognition systems on the training of mapping models.…”
Section: Mapping Modelsmentioning
confidence: 84%
“…Mapping models in the previous work [24] have been trained on frame level without considering the contextual information but connected speech is a continuous signal which poses co-articulation and temporal smearing. Furthermore, a separate model has been trained for each source-target language pair rising a requirement of N (N − 1) mapping models.…”
Section: Mapping Modelsmentioning
confidence: 99%
See 2 more Smart Citations