Assessment of many audio processing tasks relies on subjective evaluation which is time-consuming and expensive. Efforts have been made to create objective metrics but existing ones correlate poorly with human judgment. In this work, we construct a differentiable metric by fitting a deep neural network on a newly collected dataset of just-noticeable differences (JND), in which humans annotate whether a pair of audio clips are identical or not. By varying the type of differences, including noise, reverb, and compression artifacts, we are able to learn a metric that is well-calibrated with human judgments. Furthermore, we evaluate this metric by training a neural network, using the metric as a loss function. We find that simply replacing an existing loss with our metric yields significant improvement in denoising as measured by subjective pairwise comparison.
We propose a method to both identify and synchronize multicamera video recordings within a large collection of video and/or audio files. Landmark-based audio fingerprinting is used to match multiple recordings of the same event together and time-synchronize each file within the groups. Compared to prior work, we offer improvements towards event identification and a new synchronization refinement method that resolves inconsistent estimates and allows non-overlapping content to be synchronized within larger groups of recordings. Furthermore, the audio fingerprinting-based synchronization is shown to be equivalent to an efficient and scalable time-difference-of-arrival method using cross-correlation performed on a non-linearly transformed signal.
Music similarity search is useful for a variety of creative tasks such as replacing one music recording with another recording with a similar "feel", a common task in video editing. For this task, it is typically necessary to define a similarity metric to compare one recording to another. Music similarity, however, is hard to define and depends on multiple simultaneous notions of similarity (i.e. genre, mood, instrument, tempo). While prior work ignore this issue, we embrace this idea and introduce the concept of multidimensional similarity and unify both global and specialized similarity metrics into a single, semantically disentangled multidimensional similarity metric. To do so, we adapt a variant of deep metric learning called conditional similarity networks to the audio domain and extend it using track-based information to control the specificity of our model. We evaluate our method and show that our single, multidimensional model outperforms both specialized similarity spaces and alternative baselines. We also run a user-study and show that our approach is favored by human annotators as well.
Fig. 1: Given a natural sound in a real-world room that is recorded using a cellphone microphone (left), we estimate the acoustic material properties and the frequency equalization of the room using a novel deep learning approach (middle). We use the estimated acoustic material properties for generating plausible sound effects in the virtual model of the room (right). Our approach is general and robust, and works well with commodity devices.Abstract-We present a new method to capture the acoustic characteristics of real-world rooms using commodity devices, and use the captured characteristics to generate similar sounding sources with virtual models. Given the captured audio and an approximate geometric model of a real-world room, we present a novel learning-based method to estimate its acoustic material properties. Our approach is based on deep neural networks that estimate the reverberation time and equalization of the room from recorded audio. These estimates are used to compute material properties related to room reverberation using a novel material optimization objective. We use the estimated acoustic material characteristics for audio rendering using interactive geometric sound propagation and highlight the performance on many real-world scenarios. We also perform a user study to evaluate the perceptual similarity between the recorded sounds and our rendered audio.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.