Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1209
|View full text |Cite
|
Sign up to set email alerts
|

Shallow-Fusion End-to-End Contextual Biasing

Abstract: Contextual biasing to a specific domain, including a user's song names, app names and contact names, is an important component of any production-level automatic speech recognition (ASR) system. Contextual biasing is particularly challenging in end-toend models because these models keep a small list of candidates during beam search, and also do poorly on proper nouns, which is the main source of biasing phrases. In this paper, we present various algorithmic and training improvements to shallow-fusionbased biasi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
78
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 111 publications
(82 citation statements)
references
References 17 publications
4
78
0
Order By: Relevance
“…We note that the same weights are used for both phoneme and wordpiece biasing, and empirically we did not find significant improvements by using different weights. On the other hand, for wordpiece model biasing, our results are consistent with the observation in [13] that the wordpieces perform better than graphemes because of its sparsity in matching longer units.…”
Section: Wers and Comparisonssupporting
confidence: 90%
See 1 more Smart Citation
“…We note that the same weights are used for both phoneme and wordpiece biasing, and empirically we did not find significant improvements by using different weights. On the other hand, for wordpiece model biasing, our results are consistent with the observation in [13] that the wordpieces perform better than graphemes because of its sparsity in matching longer units.…”
Section: Wers and Comparisonssupporting
confidence: 90%
“…All these improvements lead to significantly better biasing which is comparable to the state-of-the-art conventional model [6]. To avoid over-biasing, [13] also proposed to only activate biasing phrases when they are proceeded by a set of prefixes.…”
Section: Shallow Fusion E2e Biasingmentioning
confidence: 99%
“…While this technique does not complicate AM training, it does not explicitly address the proper noun recognition problem and the WER gain they achieved on Librispeech was limited. Other methods that do not rely on TTS include leveraging phonetic information to build better word piece inventories [19] and fuzzing the training data with phonetically similar words [16,20].…”
Section: Related Workmentioning
confidence: 99%
“…The model should have a promising performance on user-specific information such as contacts' phone numbers and favorite song names. In [15], shallow fusion is combined with E2E models' prediction during decoding. In [16], text-to-speech (TTS) technology is utilized to generate training samples from text-only data.…”
Section: Introductionmentioning
confidence: 99%