2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00182
|View full text |Cite
|
Sign up to set email alerts
|

The Sound of Motions

Abstract: Sounds originate from object motions and vibrations of surrounding air. Inspired by the fact that humans is capable of interpreting sound sources from how objects move visually, we propose a novel system that explicitly captures such motion cues for the task of sound localization and separation. Our system is composed of an end-to-end learnable model called Deep Dense Trajectory (DDT), and a curriculum learning scheme. It exploits the inherent coherence of audio-visual signals from a large quantities of unlabe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
240
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
6
2

Relationship

3
5

Authors

Journals

citations
Cited by 243 publications
(254 citation statements)
references
References 54 publications
1
240
0
Order By: Relevance
“…Most recently, Zhao et al [56] proposed to leverage temporal motion information to improve the vision sound separation. However, this algorithm has not yet seen wide applicability to sound separation on real mixtures.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Most recently, Zhao et al [56] proposed to leverage temporal motion information to improve the vision sound separation. However, this algorithm has not yet seen wide applicability to sound separation on real mixtures.…”
Section: Related Workmentioning
confidence: 99%
“…We perform experiments on three video music performance datasets, namely MUSIC-21 [56], URMP [31] and AtinPiano [36]. MUSIC-21 is an untrimmed video dataset crawled by keyword query from Youtube.…”
Section: Datasetmentioning
confidence: 99%
See 1 more Smart Citation
“…[18,26] also explored how to generate spatial sound for videos. More recently, [40,39,13,17,32] used the visual-audio correspondence to separate sound sources. In contrast to previous work that has only transferred class-level information between modalities, this work transfers richer, region-level location information about objects.…”
Section: Cross-modal Self-supervised Learningmentioning
confidence: 99%
“…For example, the associations between speech and facial movements can be used for facial animations from speech [31,55], generating high-quality talking face from audio [54,30], separate mixed speech signals of multiple speakers [14,42], and even lip-reading from raw videos [12]. Zhao et al [63] and Zhou et al [68] have demonstrated to use optical flow like motion representations to improve the quality of visual sound separations and sound generations. There are also some recent works to explore the correlations between body motion and sound by predicting gestures from speech [22], body dynamics from music [50], or identifying a melody through body language [15].…”
Section: Motion and Soundmentioning
confidence: 99%