ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747180
|View full text |Cite
|
Sign up to set email alerts
|

MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech

Abstract: Speech quality estimation has recently undergone a paradigm shift from humanhearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantizedvariational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 22 publications
(10 citation statements)
references
References 62 publications
0
8
0
Order By: Relevance
“…To verify the generalization of our network across diverse datasets, we employed standard evaluation metrics, including the Pearson correlation coefficient (r) and root mean square error (RMSE), to quantify the disparities between predicted values and actual values. The calculation formulas, Equations ( 18)- (20), are shown as follows:…”
Section: Quantitative Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…To verify the generalization of our network across diverse datasets, we employed standard evaluation metrics, including the Pearson correlation coefficient (r) and root mean square error (RMSE), to quantify the disparities between predicted values and actual values. The calculation formulas, Equations ( 18)- (20), are shown as follows:…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…Speech quality assessment algorithms such as MOSNet, AutoMOS, and NISQA primarily focus on noise, with the models using mel-frequency cepstrum coefficient features as the vector for extracting speech quality. The NOMAM model and the speech evaluation algorithm proposed by Fu et al utilize self-supervised learning features for assessing speech quality, but the self-supervised vector training mentioned still focuses on extracting speech noise, using noise characterization to predict speech quality [19,20]. Therefore, in the design of ARCnet, not only were mel-frequency cepstrum coefficient features strongly correlated with noise used, but self-supervised vector representations for comprehensibility features relevant to downstream tasks like speech recognition were also considered.…”
Section: Introductionmentioning
confidence: 99%
“…In the literature, there have been several studies incorporating speech assessment models to improve SE performance [57]- [60], such as MetricGAN [57] and MetricGAN+ [58]. In addition, some SE methods prepare multiple SE systems and use speech assessment models to select the SE system that is most suitable for the test utterance, such as SSEMS [61] and ZMOS [62].…”
Section: Introductionmentioning
confidence: 99%
“…The robustness of this filter can be further improved by injecting noise information [16], temporal dependencies [20]- [22], and information from other modalities, such as vision [17], [23]. Besides, speech enhancement approaches based on perceptual metric-guided adversarial training [24], [25] and diffusion-based generative models [26], [27] have also been presented. In contrast, supervised masking approaches [18] aim to learn the mapping from the noisy input to a masking filter.…”
Section: Introductionmentioning
confidence: 99%