2022 IEEE Spoken Language Technology Workshop (SLT) 2023
DOI: 10.1109/slt54892.2023.10023323
|View full text |Cite
|
Sign up to set email alerts
|

NAM+: Towards Scalable End-to-End Contextual Biasing for Adaptive ASR

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(4 citation statements)
references
References 22 publications
0
1
0
Order By: Relevance
“…While CA can recognize the right entity word (which are typically the most important words of the utterance), they sometimes unnecessarily substitute common words. This results in an increase in overall WER, which is in line with previous findings on the use of contextual biasing (Munkhdalai et al, 2023). In or struggle to boost any entity word (Wy) from the catalog.…”
Section: Resultssupporting
confidence: 90%
See 1 more Smart Citation
“…While CA can recognize the right entity word (which are typically the most important words of the utterance), they sometimes unnecessarily substitute common words. This results in an increase in overall WER, which is in line with previous findings on the use of contextual biasing (Munkhdalai et al, 2023). In or struggle to boost any entity word (Wy) from the catalog.…”
Section: Resultssupporting
confidence: 90%
“…Attention-based contextual biasing modules have widely been used by ASR systems to personalize towards a catalog of a few hundred custom entities (Pundak et al, 2018;Bruguier et al, 2019;Sathyendra et al, 2022;Dingliwal et al, 2023;Munkhdalai et al, 2022). However, Munkhdalai et al (2023) showed that inference latency increases significantly even with a few thousand catalog items. Similar to our approach, they propose to filter a small set of entities using maximum inner product.…”
Section: Related Workmentioning
confidence: 99%
“…We provide the model with real-time retrieved entities in the text prompts. We report WERs on the multi-context TTS corpora in [30], where W PREFIX and WO PREFIX evaluate the in-domain performance: each utterance is assigned a correct bias entity + distractor entities; ANTI evaluates the out-of-domain performance: each utterance is associated with distractor entities only. The original corpora contains variants scaling from 0 to 3K bias entities assigned to each utterance.…”
Section: Speech Translationmentioning
confidence: 99%
“…Training-time adaptation. The second category consists of approaches that modify the ASR model during training to incorporate contextual information, often relying on attention-based mechanisms (Jain et al, 2020;Chang et al, 2021;Huber et al, 2021;Sathyendra et al, 2022;Sun et al, 2023a;Munkhdalai et al, 2023;Chan et al, 2023). Such a direct integration of contextual information is usually more accurate than shallow fusion, but it comes with the added overhead of retraining the ASR model for every new dictionary to be integrated.…”
Section: Related Work and Backgroundmentioning
confidence: 99%