Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms

Kilgour, Kevin; Zuluaga, Mauricio; Roblek, Dominik; Sharifi, Matthew

doi:10.21437/interspeech.2019-2219

Cited by 68 publications

(45 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For each model, the network is trained for about 120 epochs and the weights are saved each 8 epochs. We generated drum sounds with the regular weights and with the EMA weights and we observed the same phenomenon as in Song and Ermon [2020]: for the regular weights the quality of the sounds is not necessarily increasing with the training time whereas the EMA weights provide better and more homogeneous Fréchet Audio Distance Kilgour et al [2019] (FAD) during training 2 .…”

Section: Models and Processsupporting

confidence: 64%

CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis

Rouard,

Hadjeres

2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel score-base generative model for unconditional raw audio synthesis. Our proposal builds upon the latest developments on diffusion process modeling with stochastic differential equations, which already demonstrated promising results on image generation. We motivate novel heuristics for the choice of the diffusion processes better suited for audio generation, and consider the use of a conditional U-Net to approximate the score function. While previous approaches on diffusion models on audio were mainly designed as speech vocoders in medium resolution, our method termed CRASH (Controllable Raw Audio Synthesis with High-resolution) allows us to generate short percussive sounds in 44.1kHz in a controllable way. Through extensive experiments, we showcase on a drum sound generation task the numerous sampling schemes offered by our method (unconditional generation, deterministic generation, inpainting, interpolation, variations, class-conditional sampling) and propose the class-mixing sampling, a novel way to generate "hybrid" sounds. Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities with lighter and easier-to-train models.

show abstract

Section: Models and Processsupporting

confidence: 64%

CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis

Rouard,

Hadjeres

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We perform SSIM in the frequency domain to compare the synthetic spectrogram with the real-world sample. • Fréchet Audio Distance (FAD) [22] measures the quality and diversity of the generated samples. FAD score is the distance between two multivariate Gaussian estimated on sets of embeddings, i.e.…”

Section: Discussionmentioning

confidence: 99%

“…We use the Mean Opinion Score (MOS) test as a subjective evaluation. To evaluate each of the different vocoders objectively, we used the following four different evaluation metrics: Structural Similarity Index Measure (SSIM) [21], Fréchet Audio Distance (FAD) [22], Log-mel Spectrogram Mean Squared Error (LS-MSE), and Peak Signal-to-Noise Ratio (PSNR). More details about the experiment setup and evaluation metrics are presented in § 3.…”

Section: Introductionmentioning

confidence: 99%

VocBench: A Neural Vocoder Benchmark for Speech Synthesis

AlBadawy¹,

Gibiansky²,

He³

et al. 2021

Preprint

View full text Add to dashboard Cite

Neural vocoders, used for converting the spectral representations of an audio signal to the waveforms, are a commonly used component in speech synthesis pipelines. It focuses on synthesizing waveforms from low-dimensional representation, such as Mel-Spectrograms. In recent years, different approaches have been introduced to develop such vocoders. However, it becomes more challenging to assess these new vocoders and compare their performance to previous ones. To address this problem, we present VocBench, a framework that benchmark the performance of state-of-the art neural vocoders. VocBench uses a systematic study to evaluate different neural vocoders in a shared environment that enables a fair comparison between them. In our experiments, we use the same setup for datasets, training pipeline, and evaluation metrics for all neural vocoders. We perform a subjective and objective evaluation to compare the performance of each vocoder along a different axis. Our results demonstrate that the framework is capable of showing the competitive efficacy and the quality of the synthesized samples for each vocoder. VocBench framework is available at https://github.com/facebookresearch/vocoder-benchmark.

show abstract

“…Several studies indicate that widely-adopted source separation metrics such as signal to distortion ratio (SDR), signal to inference ratio (SIR), and signal to artifacts ratio (SAR) [56] do not always agree with human perception [7], [9], [35], [57]. Moreover, as brought out in [35], an increment of noise or interferences in the separated source produces an increment of the SAR value.…”

Section: Metricsmentioning

confidence: 99%

Conditioned Source Separation for Musical Instrument Performances

Slizovskaia

Haro

2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

In music source separation, the number of sources may vary for each piece and some of the sources may belong to the same family of instruments, thus sharing timbral characteristics and making the sources more correlated. This leads to additional challenges in the source separation problem. This paper proposes a source separation method for multiple musical instruments sounding simultaneously and explores how much additional information apart from the audio stream can lift the quality of source separation. We explore conditioning techniques at different levels of a primary source separation network and utilize two extra modalities of data, namely presence or absence of instruments in the mixture, and the corresponding video stream data.

show abstract

Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms

Cited by 68 publications

References 15 publications

CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis

CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis

VocBench: A Neural Vocoder Benchmark for Speech Synthesis

Conditioned Source Separation for Musical Instrument Performances

Contact Info

Product

Resources

About