MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech

Fu, Szu-Wei; Yu, Cheng; Hung, Kuo-Hsuan; Ravanelli, Mirco; Tsao, Yu

doi:10.1109/icassp43922.2022.9747180

Cited by 22 publications

(10 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To verify the generalization of our network across diverse datasets, we employed standard evaluation metrics, including the Pearson correlation coefficient (r) and root mean square error (RMSE), to quantify the disparities between predicted values and actual values. The calculation formulas, Equations ( 18)- (20), are shown as follows:…”

Section: Quantitative Resultsmentioning

confidence: 99%

“…Speech quality assessment algorithms such as MOSNet, AutoMOS, and NISQA primarily focus on noise, with the models using mel-frequency cepstrum coefficient features as the vector for extracting speech quality. The NOMAM model and the speech evaluation algorithm proposed by Fu et al utilize self-supervised learning features for assessing speech quality, but the self-supervised vector training mentioned still focuses on extracting speech noise, using noise characterization to predict speech quality [19,20]. Therefore, in the design of ARCnet, not only were mel-frequency cepstrum coefficient features strongly correlated with noise used, but self-supervised vector representations for comprehensibility features relevant to downstream tasks like speech recognition were also considered.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ARCnet: A Multi-Feature-Based Auto Radio Check Model

Pan,

Wang,

Zhang

et al. 2024

Aerospace

View full text Add to dashboard Cite

Radio checks serve as the foundation for ground-to-air communication. To integrate machine learning for automated and reliable radio checks, this study introduces an Auto Radio Check network (ARCnet), a novel algorithm for non-intrusive speech quality assessment in civil aviation, addressing the crucial need for dependable ground-to-air communication. By employing a multi-scale feature fusion approach, including the consideration of audio’s frequency domain, comprehensibility, and temporal information within the radio check scoring network, ARCnet integrates manually designed features with self-supervised features and utilizes a transformer network to enhance speech segment analysis. Utilizing the NISQA open-source dataset and the proprietary RadioCheckSpeech dataset, ARCnet demonstrates superior performance in predicting speech quality, showing a 12% improvement in both the Pearson correlation coefficient and root mean square error (RMSE) compared to existing models. This research not only highlights the significance of applying multi-scale attributes and deep neural network parameters in speech quality assessment but also emphasizes the crucial role of the temporal network in capturing the nuances of voice data. Through a comprehensive comparison of the ARCnet approach to traditional methods, this study underscores its innovative contribution to enhancing communication efficiency and safety in civil aviation.

show abstract

Section: Quantitative Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

ARCnet: A Multi-Feature-Based Auto Radio Check Model

Pan,

Wang,

Zhang

et al. 2024

Aerospace

View full text Add to dashboard Cite

show abstract

“…In the literature, there have been several studies incorporating speech assessment models to improve SE performance [57]- [60], such as MetricGAN [57] and MetricGAN+ [58]. In addition, some SE methods prepare multiple SE systems and use speech assessment models to select the SE system that is most suitable for the test utterance, such as SSEMS [61] and ZMOS [62].…”

Section: Introductionmentioning

confidence: 99%

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Zezario

Chen

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Non-intrusive speech assessment metrics have garnered significant attention in recent years, and several deep learning-based models have been developed accordingly. Although these models are more flexible than conventional speech assessment metrics, most of them are designed to estimate a specific evaluation score, whereas speech assessment generally involves multiple facets. Herein, we propose a cross-domain multiobjective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. More specifically, MOSA-Net is designed to estimate the speech quality, intelligibility, and distortion assessment scores of an input test speech signal. It comprises a convolutional neural network and bidirectional long short-term memory (CRNN) architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric. In addition, cross-domain features (spectral and timedomain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information from different speech representations to obtain more accurate assessments. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in short-time objective intelligibility (STOI) prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in mean opinion score (MOS) prediction, compared to MOS-SSL, a strong singletask model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and

show abstract

“…The robustness of this filter can be further improved by injecting noise information [16], temporal dependencies [20]- [22], and information from other modalities, such as vision [17], [23]. Besides, speech enhancement approaches based on perceptual metric-guided adversarial training [24], [25] and diffusion-based generative models [26], [27] have also been presented. In contrast, supervised masking approaches [18] aim to learn the mapping from the noisy input to a masking filter.…”

Section: Introductionmentioning

confidence: 99%

Integrating Uncertainty Into Neural Network-Based Speech Enhancement

Fang¹,

Becker

Wermter

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Supervised masking approaches in the time-frequency domain aim to employ deep neural networks to estimate a multiplicative mask to extract clean speech. This leads to a single estimate for each input without any guarantees or measures of reliability. In this paper, we study the benefits of modeling uncertainty in clean speech estimation. Prediction uncertainty is typically categorized into aleatoric uncertainty and epistemic uncertainty. The former refers to inherent randomness in data, while the latter describes uncertainty in the model parameters.In this work, we propose a framework to jointly model aleatoric and epistemic uncertainties in neural network-based speech enhancement. The proposed approach captures aleatoric uncertainty by estimating the statistical moments of the speech posterior distribution and explicitly incorporates the uncertainty estimate to further improve clean speech estimation. For epistemic uncertainty, we investigate two Bayesian deep learning approaches: Monte Carlo dropout and Deep ensembles to quantify the uncertainty of neural network parameters. Our analyses show that the proposed framework promotes capturing practical and reliable uncertainty, while combining different sources of uncertainties yields more reliable predictive uncertainty estimates. Furthermore, we demonstrate the benefits of modeling uncertainty on speech enhancement performance by evaluating the framework on different datasets, exhibiting notable improvement over comparable models that fail to account for uncertainty.

show abstract

MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech

Cited by 22 publications

References 62 publications

ARCnet: A Multi-Feature-Based Auto Radio Check Model

ARCnet: A Multi-Feature-Based Auto Radio Check Model

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

Integrating Uncertainty Into Neural Network-Based Speech Enhancement

Contact Info

Product

Resources

About