End-To-End Speech Recognition with Joint Dereverberation of Sub-Band Autoregressive Envelopes

Kumar, Rohit; Purushothaman, Anurenjan; Sreeram, Anirudh; Ganapathy, Sriram

doi:10.1109/icassp43922.2022.9747795

Cited by 3 publications

(3 citation statements)

References 30 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, two issues arise with this pipelined approach: 1) the learning cost function mismatch between speech enhancement frontend and recognition back-end components is not addressed; 2) the artifacts brought by the speech enhancement front-end can lead to ASR performance degradation. To this end, a tight integration of the audio-visual speech separation, dereverberation and recognition components via joint fine-tuning [19], [23], [67], [72], [78]- [82] is considered in this paper. Three finetuning methods are investigated: a) only fine-tuning the backend ASR component using the enhanced speech outputs while the front-end remains unchanged; b) end-to-end jointly finetuning the entire system including the speech enhancement front-end and the recognition back-end components using the ASR cost function; c) end-to-end jointly fine-tuning the entire system using a multi-task criterion interpolation between the speech enhancement and recognition cost functions as follows:…”

Section: B Integration Of Speech Enhancement and Recognitionmentioning

confidence: 99%

“…sys. 3) End-to-end joint fine-tuning of the speech enhancement front-end and recognition back-end is effective in mitigating the impact from spectral artifacts produced in SpecM based dereverberation [82] V. Their WER performance with respect to γ on the LRS2 simulated ("Simu") and replayed ("Replay") test sets are shown in Table VI.…”

Section: A Performance Of Audio-visual Multi-channel Speech Enhanceme...mentioning

confidence: 99%

“…An improved trade-off between the speech enhancement front-end loss function and ASR accuracy can then be obtained, for example, using multitask learning [67], [78], [79]. To date, such joint speech enhancement front-end and ASR back-end optimization has been only conducted among: a) audio-only speech enhancement and recognition systems using no video input [19], [23], [72], [78], [80]- [82]; or b) audio-visual speech separation and recognition tasks only while not considering speech dereverberation [67], [79]. Hence, there is a pressing need to derive suitable joint optimization methods for a complete audio-visual multi-channel speech separation, dereverberation and recognition system.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition

Deng

et al. 2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in maskbased MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores 1 .

show abstract