2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01049
|View full text |Cite
|
Sign up to set email alerts
|

Music Gesture for Visual Sound Separation

Abstract: Figure 1: We propose to leverage explicit body dynamics motion cues for visual sound separation in music performances.We show that our new model can perform well on both heterogeneous and homogeneous music separation tasks.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
132
0
2

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 183 publications
(139 citation statements)
references
References 46 publications
0
132
0
2
Order By: Relevance
“…Despite the variety of architectures, existing multimodal networks are mostly designed for combining vision and language and, less frequently, audio [ 39 , 40 ]. For example, refs.…”
Section: Related Workmentioning
confidence: 99%
“…Despite the variety of architectures, existing multimodal networks are mostly designed for combining vision and language and, less frequently, audio [ 39 , 40 ]. For example, refs.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, several approaches that solve for the alignment of various modalities [19,20,21,11,22,12,23,24] have also been suggested. Music gesture [25] uses the keypoint-based structured representation to explicitly model the body's and finger's dynamics motion cues for visual sound separation. A few very recent works have also explored the multimodal generation problem.…”
Section: Related Workmentioning
confidence: 99%
“…Gao et al [11] proposed a model to detect each musical instrument in a video clip of multiple sounds and divide the sound emitted from each instrument. Gan et al [10] improved the performance of time-frequency mask estimation for sound source separation using a context-aware graph network to extract information from the time series of the performer key points. In the research on human speech segmentation, a method to predict complex ratio masks with the amplitude and phase information was proposed to extract speech from the spectrogram of synthetic speech [2], [9].…”
Section: B Sound Separationmentioning
confidence: 99%