Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1623
|View full text |Cite
|
Sign up to set email alerts
|

Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 9 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…Whole word models [30], according to Zipf's law [31], would require unrealistically high amounts of transcribed training data for large vocabularies, which might not be attainable for many tasks. On the other hand, methods to generate subword vocabularies based on characters, like the currently popular byte pair encoding (BPE) approach [32], might be seen as secondary approaches outside the E2E objective, even more so if acoustic data is considered for subword derivation [33], [34], [35], [36].…”
Section: Distinctiveness Of the Term E2ementioning
confidence: 99%
See 1 more Smart Citation
“…Whole word models [30], according to Zipf's law [31], would require unrealistically high amounts of transcribed training data for large vocabularies, which might not be attainable for many tasks. On the other hand, methods to generate subword vocabularies based on characters, like the currently popular byte pair encoding (BPE) approach [32], might be seen as secondary approaches outside the E2E objective, even more so if acoustic data is considered for subword derivation [33], [34], [35], [36].…”
Section: Distinctiveness Of the Term E2ementioning
confidence: 99%
“…via byte-pair encoding [32]. Also, available pronunciation lexica can be utilized indirectly for assisting subword generation for E2E systems [35], [36], which are shown to outperform byte-pair encoding. Within classical ASR systems, phonetic clustering also can be avoided completely by modeling phonemes in context directly [220].…”
Section: Relationship To Classical Asrmentioning
confidence: 99%
“…via byte-pair encoding [27]. Also, available pronunciation lexica can be utilized indirectly for assisting subword generation for E2E systems [290], [291], which are shown to outperform byte-pair encoding. Within classical ASR systems, phonetic clustering also can be avoided completely by modeling phonemes in context directly [292].…”
Section: Use Of Large-scale Pretrained Lmsmentioning
confidence: 99%
“…[8] and the 300h SWB corpus [9]. We evaluate the proposed training pipeline on context-1 transducer models using phonemes for both corpora, and full-context transducer models using 5k acoustic data-driven subword modeling (ADSM) units [34] for LBS. Additionally, we reduce the LBS phoneme inventory in the official lexicon by unifying stressed phonemes, e.g.…”
Section: Learning Rate Scheduling and Epochsmentioning
confidence: 99%
“…By default, we apply 1-pass LM SF decoding, where word-level LM is used for phoneme transducers. The word-level transformer (Trafo) LMs are the same as in [38] for LBS and [39] (sentence-wise) for SWB, while the ADSM Trafo LM is the same as in [34].…”
Section: Learning Rate Scheduling and Epochsmentioning
confidence: 99%