2021
DOI: 10.48550/arxiv.2104.02387
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Consistent Hybrid HMM Acoustic Modeling

Tina Raissi,
Eugen Beck,
Ralf Schlüter
et al.

Abstract: High-performance hybrid automatic speech recognition (ASR) systems are often trained with clustered triphone outputs, and thus require a complex training pipeline to generate the clustering. The same complex pipeline is often utilized in order to generate an alignment for use in frame-wise cross-entropy training. In this work, we propose a flat-start factored hybrid model trained by modeling the full set of triphone states explicitly without relying on clustering methods. This greatly simplifies the training o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 24 publications
(29 reference statements)
0
2
0
Order By: Relevance
“…This is achieved by a decomposition of the joint label identity of the center phoneme state and its left-right phonemes. With this previous work, and its extension [18], we have shown that it is possible to eliminate the state-tying for the determination of the state inventory, and obtain similar performance to a hybrid CART.…”
Section: Introductionmentioning
confidence: 84%
“…This is achieved by a decomposition of the joint label identity of the center phoneme state and its left-right phonemes. With this previous work, and its extension [18], we have shown that it is possible to eliminate the state-tying for the determination of the state inventory, and obtain similar performance to a hybrid CART.…”
Section: Introductionmentioning
confidence: 84%
“…Also, available pronunciation lexica can be utilized indirectly for assisting subword generation for E2E systems [35], [36], which are shown to outperform byte-pair encoding. Within classical ASR systems, phonetic clustering also can be avoided completely by modeling phonemes in context directly [220].…”
Section: Relationship To Classical Asrmentioning
confidence: 99%
“…Also, available pronunciation lexica can be utilized indirectly for assisting subword generation for E2E systems [290], [291], which are shown to outperform byte-pair encoding. Within classical ASR systems, phonetic clustering also can be avoided completely by modeling phonemes in context directly [292].…”
Section: Use Of Large-scale Pretrained Lmsmentioning
confidence: 99%