CDPAM: Contrastive Learning for Perceptual Audio Similarity

Manocha, Pranay; Jin, Zeyu; Zhang, Richard; Finkelstein, Adam

doi:10.1109/icassp39728.2021.9413711

Cited by 26 publications

(14 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, one may consider adapting existing objective assessment metrics for quality of monaural signals such as PESQ [6], POLQA [7], DPAM [8] and CDPAM [9] for this task. However, since these metrics only focus on perceived quality rather than spatialization, their utility for multi-channel signals remains limited [1,10].…”

Section: Introductionmentioning

confidence: 99%

DPLM: A Deep Perceptual Spatial-Audio Localization Metric

Manocha

Kumar

et al. 2021

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Self Cite

View full text Add to dashboard Cite

Subjective evaluations are critical for assessing the perceptual realism of sounds in audio-synthesis driven technologies like augmented and virtual reality. However, they are challenging to set up, fatiguing for users, and expensive. In this work, we tackle the problem of capturing the perceptual characteristics of localizing sounds. Specifically, we propose a framework for building a generalpurpose quality metric to assess spatial localization differences between two binaural recordings. We model localization similarity by utilizing activation-level distances from deep networks trained for direction of arrival (DOA) estimation. Our proposed metric (DPLM) outperforms baseline metrics on correlation with subjective ratings on a diverse set of datasets, even without the benefit of any humanlabeled training data.

show abstract

Section: Introductionmentioning

confidence: 99%

DPLM: A Deep Perceptual Spatial-Audio Localization Metric

Manocha

Kumar

et al. 2021

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We also compare the subjective listening test scores with widely used audio quality metrics and suggest that, similar to speech enhancement, these metrics correlate poorly with human perception [1,5]. With this work, we hope to motivate both future research in music enhancement as well as music quality perceptual metrics akin to those in the speech literature [6,7]. To promote further research, audio samples generated in our experiments and source code are provided at our project website 1 .…”

Section: Introductionmentioning

confidence: 95%

Music Enhancement via Image Translation and Vocoding

Kandpal¹,

Nieto²,

Jin³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ. This paper presents a deep learning approach to enhance low-quality music recordings by combining (i) an image-to-image translation model for manipulating audio in its mel-spectrogram representation and (ii) a music vocoding model for mapping synthetically generated mel-spectrograms to perceptually realistic waveforms. We find that this approach to music enhancement outperforms baselines which use classical methods for mel-spectrogram inversion and an end-to-end approach directly mapping noisy waveforms to clean waveforms. Additionally, in evaluating the proposed method with a listening test, we analyze the reliability of common audio enhancement evaluation metrics when used in the music domain.

show abstract

“…Contrastive learning is a self-supervised machine-learning method that can utilize unlabeled data by learning from intrinsic similarity relations between data. Contrastive learning is widely used in speech quality assessment, in which speech representations are learned from large-scale unlabeled speech data [9][10][11]. Given scores s1 and s2 of utterances x1 and x2, the difference in the scores (dx 1 ,x 2 = s1 − s2) can be regarded as the difference in the two utterances in terms of speech quality.…”

Section: Contrastive Lossmentioning

confidence: 99%

UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

Saeki¹,

Xin²,

Nakata³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for in-domain prediction and an out-of-domain (OOD) track for which there is less labeled data from different listening tests. Our system is based on ensemble learning of strong and weak learners. Strong learners incorporate several improvements to the previous finetuning models of self-supervised learning (SSL) models, while weak learners use basic machine-learning methods to predict scores from SSL features. In the Challenge, our system had the highest score on several metrics for both the main and OOD tracks. In addition, we conducted ablation studies to investigate the effectiveness of our proposed methods.

show abstract

CDPAM: Contrastive Learning for Perceptual Audio Similarity

Cited by 26 publications

References 31 publications

DPLM: A Deep Perceptual Spatial-Audio Localization Metric

DPLM: A Deep Perceptual Spatial-Audio Localization Metric

Music Enhancement via Image Translation and Vocoding

UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

Contact Info

Product

Resources

About