“…This type of guidance has been shown to have a good impact on the source separation performance. Thus, in [12], the authors use visual guidance for improving source separation quality. Additionally, in a concurrent work [13], the authors explore a similar idea of class-conditioning over the joint embedded space, but unlike us, they use an auxiliary network to model parameters of a GMM for the final source separation, and they take spectrograms as an input of the model.…”
Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model? This paper presents an extension of the Wave-U-Net [1] model which allows end-to-end monaural source separation with a non-fixed number of sources. Furthermore, we propose multiplicative conditioning with instrument labels at the bottleneck of the Wave-U-Net and show its effect on the separation results. This approach can be further extended to other types of conditioning such as audio-visual source separation and score-informed source separation.
“…This type of guidance has been shown to have a good impact on the source separation performance. Thus, in [12], the authors use visual guidance for improving source separation quality. Additionally, in a concurrent work [13], the authors explore a similar idea of class-conditioning over the joint embedded space, but unlike us, they use an auxiliary network to model parameters of a GMM for the final source separation, and they take spectrograms as an input of the model.…”
Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model? This paper presents an extension of the Wave-U-Net [1] model which allows end-to-end monaural source separation with a non-fixed number of sources. Furthermore, we propose multiplicative conditioning with instrument labels at the bottleneck of the Wave-U-Net and show its effect on the separation results. This approach can be further extended to other types of conditioning such as audio-visual source separation and score-informed source separation.
“…The motions of players are often highly correlated with the characteristics of the sound sources [6]. There has been work on modeling such correlations for audio source separation [22]. Besides instrumental players, conductor gesture analysis has also been investigated in audiovisual music performance analysis.…”
Section: Table 1 a Categorization Of Existing Research On Audiovisuamentioning
confidence: 99%
“…Audio source separation in music recordings is a particularly interesting task, where audiovisual matching between the visual events of a performer's actions and their audio rendering can be of great value. Notably, such an approach enables addressing audio separation tasks that could not be performed in a unimodal fashion (solely analyzing the audio signal), as when considering two or more instances of the same instruments, say, a duet of guitars or violins, as done in the work of Parekh et al [22]. Knowing whether a musician is playing or not at a particular point in time gives important cues for source allocation.…”
Section: Music Source Separation Using Dynamic Correspondencementioning
confidence: 99%
“…As an illustration, we detail a model in which it is assumed that the characteristics of a sound event (e.g., a musical note) is highly correlated with the speed of sound-producing motion [22]. More precisely, the proposed approach extends the popular nonnegative matrix factorization (NMF) framework using visual information about objects' motion.…”
Section: Case Study: Motion-driven Source Separation In a String Quartetmentioning
confidence: 99%
“…More details can be found in [22], but for most situations, this joint audiovisual approach significantly outperformed the corresponding sequential approach proposed by the same authors and the audio-only approach introduced in [29]. For example, for a subset of the University of Rochester Multimodal Music Performance data set [7], the joint approach obtained a signal-to-distortion ratio of 7.14 dB for duets and 5.14 dB for trios, while the unimodal approach of [29] obtained signal-todistortion ratios of 5.11 dB and 2.18 dB, respectively.…”
Section: Case Study: Motion-driven Source Separation In a String Quartetmentioning
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.