NAM+: Towards Scalable End-to-End Contextual Biasing for Adaptive ASR

Munkhdalai, Tsendsuren; Wu, Zelin; Pundak, Golan; Sim, Khe Chai; Li, Jiayang; Rondon, Pat; Sainath, Tara N.

doi:10.1109/slt54892.2023.10023323

Cited by 10 publications

(4 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While CA can recognize the right entity word (which are typically the most important words of the utterance), they sometimes unnecessarily substitute common words. This results in an increase in overall WER, which is in line with previous findings on the use of contextual biasing (Munkhdalai et al, 2023). In or struggle to boost any entity word (Wy) from the catalog.…”

Section: Resultssupporting

confidence: 90%

“…Attention-based contextual biasing modules have widely been used by ASR systems to personalize towards a catalog of a few hundred custom entities (Pundak et al, 2018;Bruguier et al, 2019;Sathyendra et al, 2022;Dingliwal et al, 2023;Munkhdalai et al, 2022). However, Munkhdalai et al (2023) showed that inference latency increases significantly even with a few thousand catalog items. Similar to our approach, they propose to filter a small set of entities using maximum inner product.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Retrieve and Copy: Scaling ASR Personalization to Large Catalogs

Jayanthi,

Kulshreshtha,

Dingliwal

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

View full text Add to dashboard Cite

Personalizationof automatic speech recognition (ASR) models is a widely studied topic because of its many practical applications. Most recently, attention-based contextual biasing techniques are used to improve the recognition of rare words and/or domain specific entities. However, due to performance constraints, the biasing is often limited to a few thousand entities, restricting real-world usability. To address this, we first propose a "Retrieve and Copy" mechanism to improve latency while retaining the accuracy even when scaled to a large catalog. We also propose a training strategy to overcome the degradation in recall at such scale due to an increased number of confusing entities. Overall, our approach achieves up to 6% more Word Error Rate reduction (WERR) and 3.6% absolute improvement in F1 when compared to a strong baseline. Our method also allows for large catalog sizes of up to 20K without significantly affecting WER and F1-scores, while achieving at least 20% inference speedup per acoustic frame.

show abstract

Section: Resultssupporting

confidence: 90%

Section: Related Workmentioning

confidence: 99%

Retrieve and Copy: Scaling ASR Personalization to Large Catalogs

Jayanthi,

Kulshreshtha,

Dingliwal

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

View full text Add to dashboard Cite

show abstract

“…We provide the model with real-time retrieved entities in the text prompts. We report WERs on the multi-context TTS corpora in [30], where W PREFIX and WO PREFIX evaluate the in-domain performance: each utterance is assigned a correct bias entity + distractor entities; ANTI evaluates the out-of-domain performance: each utterance is associated with distractor entities only. The original corpora contains variants scaling from 0 to 3K bias entities assigned to each utterance.…”

Section: Speech Translationmentioning

confidence: 99%

Microwave ablation versus laparoscopic resection as first‐line therapy for solitary 3–5‐cm HCC

et al. 2022

View full text Add to dashboard Cite

Background and Aims The study objective was to compare the effectiveness of microwave ablation (MWA) and laparoscopic liver resection (LLR) on solitary 3–5‐cm HCC over time. Approach and Results From 2008 to 2019, 1289 patients from 12 hospitals were enrolled in this retrospective study. Diagnosis of all lesions were based on histopathology. Propensity score matching was used to balance all baseline variables between the two groups in 2008–2019 (n = 335 in each group) and 2014–2019 (n = 257 in each group) cohorts, respectively. For cohort 2008–2019, during a median follow‐up of 35.8 months, there were no differences in overall survival (OS) between MWA and LLR (HR: 0.88, 95% CI 0.65–1.19, p = 0.420), and MWA was inferior to LLR regarding disease‐free survival (DFS) (HR 1.36, 95% CI 1.05–1.75, p = 0.017). For cohort 2014–2019, there was comparable OS (HR 0.85, 95% CI 0.56–1.30, p = 0.460) and approached statistical significance for DFS (HR 1.33, 95% CI 0.98–1.82, p = 0.071) between MWA and LLR. Subgroup analyses showed comparable OS in 3.1–4.0‐cm HCCs (HR 0.88, 95% CI 0.53–1.47, p = 0.630) and 4.1–5.0‐cm HCCs (HR 0.77, 95% CI 0.37–1.60, p = 0.483) between two modalities. For both cohorts, MWA shared comparable major complications (both p > 0.05), shorter hospitalization, and lower cost to LLR (all p < 0.001). Conclusions MWA might be a first‐line alternative to LLR for solitary 3–5‐cm HCC in selected patients with technical advances, especially for patients unsuitable for LLR.

show abstract

“…Training-time adaptation. The second category consists of approaches that modify the ASR model during training to incorporate contextual information, often relying on attention-based mechanisms (Jain et al, 2020;Chang et al, 2021;Huber et al, 2021;Sathyendra et al, 2022;Sun et al, 2023a;Munkhdalai et al, 2023;Chan et al, 2023). Such a direct integration of contextual information is usually more accurate than shallow fusion, but it comes with the added overhead of retraining the ASR model for every new dictionary to be integrated.…”

Section: Related Work and Backgroundmentioning

confidence: 99%

Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries

Mittal,

Sarawagi,

Jyothi

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Despite the impressive performance of ASR models on mainstream benchmarks, their performance on rare words is unsatisfactory. In enterprise settings, often a focused list of entities (such as locations, names, etc) are available which can be used to adapt the model to the terminology of specific domains. In this paper, we present a novel inference algorithm that improves the prediction of state-of-the-art ASR models using nearest-neighbor-based matching on an inference-time word list. We consider both the Transducer architecture that is useful in the streaming setting, and state-of-the-art encoder-decoder models such as Whisper.In our approach, a list of rare entities is indexed in a memory by synthesizing speech for each entry, and then storing the internal acoustic and language model states obtained from the best possible alignment on the ASR model. The memory is organized as a trie which we harness to perform a stateful lookup during inference. A key property of our extension is that we prevent spurious matches by restricting to only word-level matches. In our experiments on publicly available datasets and private benchmarks, we show that our method is effective in significantly improving rare word recognition.

show abstract

NAM+: Towards Scalable End-to-End Contextual Biasing for Adaptive ASR

Cited by 10 publications

References 22 publications

Retrieve and Copy: Scaling ASR Personalization to Large Catalogs

Retrieve and Copy: Scaling ASR Personalization to Large Catalogs

Microwave ablation versus laparoscopic resection as first‐line therapy for solitary 3–5‐cm HCC

Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries

Contact Info

Product

Resources

About