Confidence Score Based Conformer Speaker Adaptation for Speech Recognition

Xie, Xurong; Wang, Tianzi; Cui, Mingyu; Xue, Boyang; Jin, Zengrui; Li, Guinan; Liu, Xunying; Meng, Helen

doi:10.21437/interspeech.2022-680

Cited by 7 publications

(1 citation statement)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…T HE performance of automatic speech recognition (ASR) systems has been significantly improved over the past decades with the wide application of deep learning techniques [1]- [15]. More recently there has been a major trend in the speech technology field transiting from hybrid ASR systems [1]- [3] to end-to-end (E2E) modelling [4]- [8], [10], [11] which utilizes a single neural network to directly map acoustic feature vectors to the surface word or token sequences.…”

Section: Introductionmentioning

confidence: 99%

Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

Xie

Wang

Cui

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised modelbased speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compact and data efficient speaker-dependent (SD) parameter representations are used to facilitate both speaker adaptive training and test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR systems. The sensitivity to supervision quality is reduced using a confidence score-based selection of the less erroneous subset of speaker-level adaptation data. Two lightweight confidence score estimation modules are proposed to produce more reliable confidence scores. The data sparsity issue, which is exacerbated by data selection, is addressed by modelling the SD parameter uncertainty using Bayesian learning. Experiments on the benchmark 300-hour Switchboard and the 233-hour AMI datasets suggest that the proposed confidence score-based adaptation schemes consistently outperformed the baseline speaker-independent (SI) Conformer model and conventional non-Bayesian, point estimate-based adaptation using no speaker data selection. Similar consistent performance improvements were retained after external Transformer and LSTM language model rescoring. In particular, on the 300-hour Switchboard corpus, statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute (9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also obtained on the AMI development and evaluation sets.

show abstract

Section: Introductionmentioning

confidence: 99%

Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

Xie

Wang

Cui

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Xiao,

Huang,

Liu

et al. 2024

Complex Intell. Syst.

View full text Add to dashboard Cite

Conformer-based models have proven highly effective in Audio-visual Speech Recognition, integrating auditory and visual inputs to significantly enhance speech recognition accuracy. However, the widely utilized softmax attention mechanism within conformer models encounters scalability issues, with its spatial and temporal complexity escalating quadratically with sequence length. To address these challenges, this paper introduces the Shifted Linear Attention Conformer, an evolved iteration of the conformer architecture. Shifted Linear Attention Conformer adopts shifted linear attention as a scalable alternative to softmax attention. We conducted a thorough analysis of the factors constraining the efficiency of linear attention. To mitigate these issues, we propose the utilization of a straightforward yet potent mapping function and an efficient rank restoration module, enhancing the effectiveness of self-attention while maintaining low computational complexity. Furthermore, we integrate an advanced attention-shifting technique facilitating token manipulation within attentional mechanisms, thereby enhancing information flow across various groups. This three-part approach enhances cognitive computations, particularly beneficial for processing longer sequences. Our model achieves exceptional Word Error Rates of 1.9% and 1.5% on the Lip Reading Sentences 2 and Lip Reading Sentences 3 datasets, respectively, showcasing its state-of-the-art performance in audio-visual speech recognition tasks.

show abstract