Guiding audio source separation by video object information

Parekh, Sanjeel; Essid, Slim; Ozerov, Alexey; Duong, Ngoc Q. K.; Pérez, Patrick; Richard, Gaël

doi:10.1109/waspaa.2017.8169995

Cited by 16 publications

(18 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This type of guidance has been shown to have a good impact on the source separation performance. Thus, in [12], the authors use visual guidance for improving source separation quality. Additionally, in a concurrent work [13], the authors explore a similar idea of class-conditioning over the joint embedded space, but unlike us, they use an auxiliary network to model parameters of a GMM for the final source separation, and they take spectrograms as an input of the model.…”

Section: Related Workmentioning

confidence: 99%

End-to-end Sound Source Separation Conditioned on Instrument Labels

Slizovskaia

Kim

Haro

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model? This paper presents an extension of the Wave-U-Net [1] model which allows end-to-end monaural source separation with a non-fixed number of sources. Furthermore, we propose multiplicative conditioning with instrument labels at the bottleneck of the Wave-U-Net and show its effect on the separation results. This approach can be further extended to other types of conditioning such as audio-visual source separation and score-informed source separation.

show abstract

Section: Related Workmentioning

confidence: 99%

End-to-end Sound Source Separation Conditioned on Instrument Labels

Slizovskaia

Kim

Haro

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The motions of players are often highly correlated with the characteristics of the sound sources [6]. There has been work on modeling such correlations for audio source separation [22]. Besides instrumental players, conductor gesture analysis has also been investigated in audiovisual music performance analysis.…”

Section: Table 1 a Categorization Of Existing Research On Audiovisuamentioning

confidence: 99%

“…Audio source separation in music recordings is a particularly interesting task, where audiovisual matching between the visual events of a performer's actions and their audio rendering can be of great value. Notably, such an approach enables addressing audio separation tasks that could not be performed in a unimodal fashion (solely analyzing the audio signal), as when considering two or more instances of the same instruments, say, a duet of guitars or violins, as done in the work of Parekh et al [22]. Knowing whether a musician is playing or not at a particular point in time gives important cues for source allocation.…”

Section: Music Source Separation Using Dynamic Correspondencementioning

confidence: 99%

“…As an illustration, we detail a model in which it is assumed that the characteristics of a sound event (e.g., a musical note) is highly correlated with the speed of sound-producing motion [22]. More precisely, the proposed approach extends the popular nonnegative matrix factorization (NMF) framework using visual information about objects' motion.…”

Section: Case Study: Motion-driven Source Separation In a String Quartetmentioning

confidence: 99%

“…More details can be found in [22], but for most situations, this joint audiovisual approach significantly outperformed the corresponding sequential approach proposed by the same authors and the audio-only approach introduced in [29]. For example, for a subset of the University of Rochester Multimodal Music Performance data set [7], the joint approach obtained a signal-to-distortion ratio of 7.14 dB for duets and 5.14 dB for trios, while the unimodal approach of [29] obtained signal-todistortion ratios of 5.11 dB and 2.18 dB, respectively.…”

Section: Case Study: Motion-driven Source Separation In a String Quartetmentioning

confidence: 99%

See 2 more Smart Citations