Proceedings of the on Thematic Workshops of ACM Multimedia 2017 2017
DOI: 10.1145/3126686.3126723
|View full text |Cite
|
Sign up to set email alerts
|

Deep Cross-Modal Audio-Visual Generation

Abstract: Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite works in computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specif… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
158
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 197 publications
(172 citation statements)
references
References 29 publications
0
158
0
Order By: Relevance
“…presents results on MUSIC as a function of the training source: single-source videos (solo) or multi-source videos (solo + duet). Our method consistently outperforms all baselines in separation accuracy, as captured by the SDR and SIR metrics 6. While the SoP method[52] works well Average audio source separation results on a held out MUSIC test set.…”
mentioning
confidence: 81%
See 2 more Smart Citations
“…presents results on MUSIC as a function of the training source: single-source videos (solo) or multi-source videos (solo + duet). Our method consistently outperforms all baselines in separation accuracy, as captured by the SDR and SIR metrics 6. While the SoP method[52] works well Average audio source separation results on a held out MUSIC test set.…”
mentioning
confidence: 81%
“…Generating Sounds from Video Sound generation methods synthesize a sound track from a visual input [32,54,6]. Given both visual input and monaural audio, recent methods generate spatial (binaural or ambisonic) audio [13,30].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…3 [45] DATASET AND OXFORD-102 [46] DATASET. WE COMPARE OUR MODEL WITH "TWO-STAGE" METHOD, CLASSIFIER-BASED METHOD [11], AND TEXT-TO-IMAGE MODELS [3], [4]. AS ILLUSTRATED IN THE TABLE,…”
Section: B Experimental Results On Synthesized Datamentioning
confidence: 99%
“…Network settings and training details are in the materials 6. The deer class of CIFAR-10 is removed due to its absence in the Ima-geNet dataset.…”
mentioning
confidence: 99%