Investigation of a Single-Channel Frequency-Domain Speech Enhancement Network to Improve End-to-End Bengali Automatic Speech Recognition Under Unseen Noisy Conditions

Noor, Md Mahbub E; Lu, Yen-Ju; Wang, Syu-Siang; Ghose, Supratip; Chang, Chia‐Yu; Zezario, Ryandhimas E.; Ahmed, Shafique; Chung, Wei-Ho; Tsao, Yu; Wang, Hsin‐Min

doi:10.1109/o-cocosda202152914.2021.9660563

Cited by 2 publications

(4 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This section evaluated the performance of the conformer-transducer ASR model applied using the proposed two-step joint optimization approach and compared it with the performance using multi-condition training [ 27 ] and the conventional joint optimization approaches [ 37 , 38 ]. In addition, an ablation study was performed to examine the effectiveness of the proposed joint optimization approach according to each processing block of the conformer-transducer ASR model.…”

Section: Methodsmentioning

confidence: 99%

“…The performance of each ASR model obtained by various optimization approaches was evaluated by measuring the character error rate (CER) and word error rate (WER). The ASR models compared here were (1) ASR-only trained using the clean training dataset; (2) ASR-only trained using the noisy training dataset; (3) a combination of the speech enhancement (SE) and ASR models (denoted as SE-ASR) after each of the two models was separately trained using the noisy training dataset; (4) a combined model of the SE and ASR models (denoted as SE+ASR) trained by a conventional joint optimization as in [ 37 ]; (5) SE+ASR trained using a conventional two-step joint optimization as in [ 38 ]; and (6) SE+ASR trained using the proposed two-step joint optimization. Note that all the combined models from (3) to (6) were trained using the noisy training dataset.…”

Section: Methodsmentioning

confidence: 99%

“…The use of a joint optimization framework between a speech enhancement and an ASR model has also been investigated [ 34 , 35 , 36 , 37 ]. Specifically, a speech enhancement model and an ASR model can be used as a front-end and a back-end module, respectively, to construct a pipeline for joint optimization.…”

Section: Introductionmentioning

confidence: 99%

“…Additionally, a pipeline for joint optimization composed of a bi-directional long-short term memory (BiLSTM)-based speech enhancement and a conformer-based ASR was proposed [ 35 , 36 ]. In [ 37 ], joint optimization was performed by constructing a pipeline using the model parameters of the front-end and the back-end model, where the front-end and back-end models were already trained individually with their own loss functions.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

Lee

Kim²

2022

Sensors

View full text Add to dashboard Cite

In this paper, a new two-step joint optimization approach based on the asynchronous subregion optimization method is proposed for training a pipeline model composed of two different models. The first-step processing of the proposed joint optimization approach trains the front-end model only, and the second-step processing trains all the parameters of the combined model together. In the asynchronous subregion optimization method, the first-step processing only supports the goal of the front-end model. However, the first-step processing of the proposed approach works with a new loss function to make the front-end model support the goal of the back-end model. The proposed optimization approach was applied, here, to a pipeline composed of a deep complex convolutional recurrent network (DCCRN)-based speech enhancement model and a conformer-transducer-based ASR model as a front-end and a back-end, respectively. Then, the performance of the proposed two-step joint optimization approach was evaluated on the LibriSpeech automatic speech recognition (ASR) corpus in noisy environments by measuring the character error rate (CER) and word error rate (WER). In addition, an ablation study was carried out to examine the effectiveness of the proposed optimization approach on each of the processing blocks in the conformer-transducer ASR model. Consequently, it was shown from the ablation study that the conformer-transducer-based ASR model with the joint network trained only by the proposed optimization approach achieved the lowest average CER and WER. Moreover, the proposed optimization approach reduced the average CER and WER on the Test-Noisy dataset under matched noise conditions by 0.30% and 0.48%, respectively, compared to the approach of separate optimization of speech enhancement and ASR. Compared to the conventional two-step joint optimization approach, the proposed optimization approach provided average CER and WER reductions of 0.22% and 0.31%, respectively. Moreover, it was revealed that the proposed optimization approach achieved a lower average CER and WER, by 0.32% and 0.43%, respectively, than the conventional optimization approach under mismatched noise conditions.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%