Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator

Sun, Guangzhi; Zhang, Chao; Woodland, Philip C.

doi:10.1109/taslp.2022.3224286

Cited by 6 publications

(4 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With similar levels of R-WER reduction, AED achieved a higher reduction in WER. As analysed in [50], TCPGen produced a much more confident prediction of P gen with AED than N-T, where the main reductions in overall WER were attributed to the reduction in R-WER. The improvements using GNN indicate that the GNN encoding improved the prediction of P gen , which was more beneficial for the overall WER in AED.…”

Section: B Librispeech 960-hour Resultsmentioning

confidence: 94%

See 1 more Smart Citation

Graph Neural Networks for Contextual ASR With the Tree-Constrained Pointer Generator

Sun,

Zhang,

Woodland

2024

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Incorporating biasing words obtained through contextual knowledge is paramount in automatic speech recognition (ASR) applications. This paper proposes an innovative method for achieving end-to-end contextual ASR using graph neural network (GNN) encodings based on the tree-constrained pointer generator method. GNN node encodings facilitate lookahead for future word pieces in the process of ASR decoding at each tree node by incorporating information about all word pieces on the tree branches rooted from it. This results in a more precise prediction of the generation probability of the biasing words. The study explores three GNN encoding techniques: namely the tree recursive neural network (Tree-RNN), the graph convolutional network (GCN), and GraphSAGE, along with different combinations of the complementary GCN and GraphSAGE structures. The performance of the systems was evaluated using both Librispeech and the AMI corpus with a visual-grounded contextual ASR pipeline. The findings indicate that using GNN encodings achieved consistent and significant reductions in word error rate (WER), particularly for words that are rare or have not been seen during the training process. Notably, on LibriSpeech test sets, the combined GNN proposed in this paper achieved a 20% relative rare word error rate reduction compared to Tree-RNN, 30%-40% compared to standard TCPGen and 60% compared to standard ASR systems without TCPGen.

show abstract

Section: B Librispeech 960-hour Resultsmentioning

confidence: 94%

“…where P mdl (Y ) is the probability from the end-to-end system, P src (Y ) is the probability of the source domain LM and P tgt (Y ) is the target domain LM probability. Extending this idea to contextual biasing with TCPGen [50], BLMD can be applied as shown in Eqn. (7).…”

Section: A Biasing-driven Lm Discounting (Blmd) For Tcpgenmentioning

confidence: 99%

Graph Neural Networks for Contextual ASR With the Tree-Constrained Pointer Generator

Sun,

Zhang,

Woodland

2024

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The overfitting issue during LSTM training can be mitigated with the use of dropout for LSTM. An rnnDrop approach is proposed in [24]. for use in speech recognition problems.…”

Section: Methodsmentioning

confidence: 99%

Development of efficient techniques for ASR System for Speech Detection and Recognization system using Gaussian Mixture Model- Universal Background Model

Veera V Rama Rao M

2023

IJRITCC

View full text Add to dashboard Cite

Some practical uses of ASR have been implemented, including the transcription of meetings and the usage of smart speakers. It is the process by which speech waves are transformed into text that allows computers to interpret and act upon human speech. Scalable strategies for developing ASR systems in languages where no voice transcriptions or pronunciation dictionaries exist are the primary focus of this work. We first show that the necessity for voice transcription into the target language can be greatly reduced through cross-lingual acoustic model transfer when phonemic pronunciation lexicons exist in the new language. Afterwards, we investigate three approaches to dealing with languages that lack a pronunciation lexicon. Secondly, we have a look at the efficiency of graphemic acoustic model transfer, which makes it easy to build pronunciation dictionaries. Thesis problems can be solved, in part, by investigating optimization strategies for training on huge corpora (such as GA+HMM and DE+HMM). In the training phase of acoustic modelling, the suggested method is applied to traditional methods. Read speech and HMI voice experiments indicated that while each data augmentation strategy alone did not always increase recognition performance, using all three techniques together did. Power normalised cepstral coefficient (PNCC) features are tweaked somewhat in this work to enhance verification accuracy. To increase speaker verification accuracy, we suggest employing multiple “Gaussian Mixture Model-Universal Background Model (GMM-UBM) and SVM classifiers”. Importantly, pitch shift data augmentation and multi-task training reduced bias by more than 18% absolute compared to the baseline system for read speech, and applying all three data augmentation techniques during fine tuning reduced bias by more than 7% for HMI speech, while increasing recognition accuracy of both native and non-native Dutch speech.

show abstract

“…Training-time adaptation. The second category consists of approaches that modify the ASR model during training to incorporate contextual information, often relying on attention-based mechanisms (Jain et al, 2020;Chang et al, 2021;Huber et al, 2021;Sathyendra et al, 2022;Sun et al, 2023a;Munkhdalai et al, 2023;Chan et al, 2023). Such a direct integration of contextual information is usually more accurate than shallow fusion, but it comes with the added overhead of retraining the ASR model for every new dictionary to be integrated.…”

Section: Related Work and Backgroundmentioning

confidence: 99%

Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries

Mittal,

Sarawagi,

Jyothi

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Despite the impressive performance of ASR models on mainstream benchmarks, their performance on rare words is unsatisfactory. In enterprise settings, often a focused list of entities (such as locations, names, etc) are available which can be used to adapt the model to the terminology of specific domains. In this paper, we present a novel inference algorithm that improves the prediction of state-of-the-art ASR models using nearest-neighbor-based matching on an inference-time word list. We consider both the Transducer architecture that is useful in the streaming setting, and state-of-the-art encoder-decoder models such as Whisper.In our approach, a list of rare entities is indexed in a memory by synthesizing speech for each entry, and then storing the internal acoustic and language model states obtained from the best possible alignment on the ASR model. The memory is organized as a trie which we harness to perform a stateful lookup during inference. A key property of our extension is that we prevent spurious matches by restricting to only word-level matches. In our experiments on publicly available datasets and private benchmarks, we show that our method is effective in significantly improving rare word recognition.

show abstract

Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator

Cited by 6 publications

References 64 publications

Graph Neural Networks for Contextual ASR With the Tree-Constrained Pointer Generator

Graph Neural Networks for Contextual ASR With the Tree-Constrained Pointer Generator

Development of efficient techniques for ASR System for Speech Detection and Recognization system using Gaussian Mixture Model- Universal Background Model

Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries

Contact Info

Product

Resources

About