The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

Petermann, Darius; Wichern, Gordon; Wang, Zhongqiu; Roux, Jonathan Le

doi:10.1109/icassp43922.2022.9746005

Cited by 9 publications

(18 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The overall operation of this module is represented by the transformation R : R D×B×T → R D×B×T to obtain the output Λ = R (V) ∈ R D×B×T . TF modeling with 8 residual GRU pairs accounts for 10.5 M trainable parameters 3 .…”

Section: Time Frequency Modelingmentioning

confidence: 99%

“…This indicates that the gradient ∂ ûi will be scaled down if the error on vi is high and vice versa, diluting the 3 Due to the computational complexity of backpropagation through time with long sequences, we experimented with replacing the RNNs with transformer encoders or convolutional layers. With similar numbers of parameters and all else being equal, these were not able to match the performance of an RNN-based module.…”

Section: F Loss Functionmentioning

confidence: 99%

“…Cinematic audio source separation (CASS) is a relatively new subtask of audio SS, most commonly concerned with extracting the dialogue, music, and effects stems from their mixture. Research traction in this new subtask can be credited to Petermann et al [3,4] and the Cinematic Sound Demixing track of the Sound Demixing Challenge [5], introduced in 2023. While the setup of the task can be easily generalized from standard SS setups, the nature of cinematic audio poses a unique problem not commonly seen in speech or music SS.…”

Section: Introductionmentioning

confidence: 99%

“…We further provide empirical results to demonstrate that the common-encoder setup provides superior results for hard-tolearn stems and allows generalization to previously untrained targets without the need for retraining the entire model. To the best of our knowledge, our proposed method 1 is currently the state of the art on the Divide and Remaster (DnR) dataset [3].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Generalized Bandsplit Neural Network for Cinematic Audio Source separation

Watcharasupat,

Wu,

Ding

et al. 2023

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signalto-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.

show abstract

Section: Time Frequency Modelingmentioning

confidence: 99%

Section: F Loss Functionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Generalized Bandsplit Neural Network for Cinematic Audio Source separation

Watcharasupat,

Wu,

Ding

et al. 2023

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

show abstract

“…As a result, singing has a piecewise constant pitch with rapid pitch shifts and other sorts of variations. Until recently, various research strategies and algorithms have been introduced to improve the separation results in SVS tasks [ 22 , 23 ]. The deep learning techniques [ 24 , 25 , 26 , 27 ] are perhaps the most widely used for SVS.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

Wang

2023

Sensors

View full text Add to dashboard Cite

Singing-voice separation is a separation task that involves a singing voice and musical accompaniment. In this paper, we propose a novel, unsupervised methodology for extracting a singing voice from the background in a musical mixture. This method is a modification of robust principal component analysis (RPCA) that separates a singing voice by using weighting based on gammatone filterbank and vocal activity detection. Although RPCA is a helpful method for separating voices from the music mixture, it fails when one single value, such as drums, is much larger than others (e.g., the accompanying instruments). As a result, the proposed approach takes advantage of varying values between low-rank (background) and sparse matrices (singing voice). Additionally, we propose an expanded RPCA on the cochleagram by utilizing coalescent masking on the gammatone. Finally, we utilize vocal activity detection to enhance the separation outcomes by eliminating the lingering music signal. Evaluation results reveal that the proposed approach provides superior separation outcomes than RPCA on ccMixter and DSD100 datasets.

show abstract