ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747795
|View full text |Cite
|
Sign up to set email alerts
|

End-To-End Speech Recognition with Joint Dereverberation of Sub-Band Autoregressive Envelopes

Abstract: The end-to-end (E2E) automatic speech recognition (ASR) systems are often required to operate in reverberant conditions, where the long-term sub-band envelopes of the speech are temporally smeared. In this paper, we develop a feature enhancement approach using a neural model operating on sub-band temporal envelopes. The temporal envelopes are modeled using the framework of frequency domain linear prediction (FDLP). The neural enhancement model proposed in this paper performs an envelope gain based enhancement … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 30 publications
(48 reference statements)
0
3
0
Order By: Relevance
“…However, two issues arise with this pipelined approach: 1) the learning cost function mismatch between speech enhancement frontend and recognition back-end components is not addressed; 2) the artifacts brought by the speech enhancement front-end can lead to ASR performance degradation. To this end, a tight integration of the audio-visual speech separation, dereverberation and recognition components via joint fine-tuning [19], [23], [67], [72], [78]- [82] is considered in this paper. Three finetuning methods are investigated: a) only fine-tuning the backend ASR component using the enhanced speech outputs while the front-end remains unchanged; b) end-to-end jointly finetuning the entire system including the speech enhancement front-end and the recognition back-end components using the ASR cost function; c) end-to-end jointly fine-tuning the entire system using a multi-task criterion interpolation between the speech enhancement and recognition cost functions as follows:…”
Section: B Integration Of Speech Enhancement and Recognitionmentioning
confidence: 99%
See 2 more Smart Citations
“…However, two issues arise with this pipelined approach: 1) the learning cost function mismatch between speech enhancement frontend and recognition back-end components is not addressed; 2) the artifacts brought by the speech enhancement front-end can lead to ASR performance degradation. To this end, a tight integration of the audio-visual speech separation, dereverberation and recognition components via joint fine-tuning [19], [23], [67], [72], [78]- [82] is considered in this paper. Three finetuning methods are investigated: a) only fine-tuning the backend ASR component using the enhanced speech outputs while the front-end remains unchanged; b) end-to-end jointly finetuning the entire system including the speech enhancement front-end and the recognition back-end components using the ASR cost function; c) end-to-end jointly fine-tuning the entire system using a multi-task criterion interpolation between the speech enhancement and recognition cost functions as follows:…”
Section: B Integration Of Speech Enhancement and Recognitionmentioning
confidence: 99%
“…sys. 3) End-to-end joint fine-tuning of the speech enhancement front-end and recognition back-end is effective in mitigating the impact from spectral artifacts produced in SpecM based dereverberation [82] V. Their WER performance with respect to γ on the LRS2 simulated ("Simu") and replayed ("Replay") test sets are shown in Table VI.…”
Section: A Performance Of Audio-visual Multi-channel Speech Enhanceme...mentioning
confidence: 99%
See 1 more Smart Citation