Shallow-Fusion End-to-End Contextual Biasing

Zhao, Ding; Sainath, Tara N.; Rybach, David; Rondon, Pat; Bhatia, Deepti; Li, Bo; Pang, Ruoming

doi:10.21437/interspeech.2019-1209

Cited by 111 publications

(82 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We note that the same weights are used for both phoneme and wordpiece biasing, and empirically we did not find significant improvements by using different weights. On the other hand, for wordpiece model biasing, our results are consistent with the observation in [13] that the wordpieces perform better than graphemes because of its sparsity in matching longer units.…”

Section: Wers and Comparisonssupporting

confidence: 90%

See 1 more Smart Citation

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

Jean

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Contextual automatic speech recognition, i.e., biasing recognition towards a given context (e.g. user's playlists, or contacts), is challenging in end-to-end (E2E) models. Such models maintain a limited number of candidates during beam-search decoding, and have been found to recognize rare named entities poorly. The problem is exacerbated when biasing towards proper nouns in foreign languages, e.g., geographic location names, which are virtually unseen in training and are thus outof-vocabulary (OOV). While grapheme or wordpiece E2E models might have a difficult time spelling OOV words, phonemes are more acoustically salient and past work has shown that E2E phoneme models can better predict such words. In this work, we propose an E2E model containing both English wordpieces and phonemes in the modeling space, and perform contextual biasing of foreign words at the phoneme level by mapping pronunciations of foreign words into similar English phonemes. In experimental evaluations, we find that the proposed approach performs 16% better than a grapheme-only biasing model, and 8% better than a wordpiece-only biasing model on a foreign place name recognition task, with only slight degradation on regular English tasks.

show abstract

Section: Wers and Comparisonssupporting

confidence: 90%

“…All these improvements lead to significantly better biasing which is comparable to the state-of-the-art conventional model [6]. To avoid over-biasing, [13] also proposed to only activate biasing phrases when they are proceeded by a set of prefixes.…”

Section: Shallow Fusion E2e Biasingmentioning

confidence: 99%

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

Jean

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…While this technique does not complicate AM training, it does not explicitly address the proper noun recognition problem and the WER gain they achieved on Librispeech was limited. Other methods that do not rely on TTS include leveraging phonetic information to build better word piece inventories [19] and fuzzing the training data with phonetically similar words [16,20].…”

Section: Related Workmentioning

confidence: 99%

G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Koehler

Fliegen

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on nonphonemic languages like English. However, graphemic ASR still has problems with rare long-tail words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel method to train a statistical grapheme-to-grapheme (G2G) model on text-tospeech data that can rewrite an arbitrary character sequence into more phonetically consistent forms. We show that using G2G to provide alternative pronunciations during decoding reduces Word Error Rate by 3% to 11% relative over a strong graphemic baseline and bridges the gap on rare name recognition with an equivalent phonetic setup. Unlike many previously proposed methods, our method does not require any change to the acoustic model training procedure. This work reaffirms the efficacy of grapheme-based modeling and shows that specialized linguistic knowledge, when available, can be leveraged to improve graphemic ASR.

show abstract

“…The model should have a promising performance on user-specific information such as contacts' phone numbers and favorite song names. In [15], shallow fusion is combined with E2E models' prediction during decoding. In [16], text-to-speech (TTS) technology is utilized to generate training samples from text-only data.…”

Section: Introductionmentioning

confidence: 99%

Tiny Transducer: A Highly-Efficient Speech Recognition Model on Edge Devices

Zhang

Sun

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper proposes an extremely lightweight phonebased transducer model with a tiny decoding graph on edge devices. First, a phone synchronous decoding (PSD) algorithm based on blank label skipping is first used to speed up the transducer decoding process. Then, to decrease the deletion errors introduced by the high blank score, a blank label deweighting approach is proposed. To reduce parameters and computation, deep feedforward sequential memory network (DFSMN) layers are used in the transducer encoder, and a CNN-based stateless predictor is adopted. SVD technology compresses the model further. WFST-based decoding graph takes the context-independent (CI) phone posteriors as input and allows us to flexibly bias user-specific information. Finally, with only 0.9M parameters after SVD, our system could give a relative 9.1% -20.5% improvement compared with a bigger conventional hybrid system on edge devices.

show abstract

Shallow-Fusion End-to-End Contextual Biasing

Cited by 111 publications

References 17 publications

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models

G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Tiny Transducer: A Highly-Efficient Speech Recognition Model on Edge Devices

Contact Info

Product

Resources

About