Deep Cross-Modal Audio-Visual Generation

Chen, Lele; Srivastava, Sunita; Duan, Zhiyao; Xu, Chenliang

doi:10.1145/3126686.3126723

Cited by 197 publications

(172 citation statements)

References 29 publications

Supporting

Mentioning

158

Contrasting

Order By: Relevance

“…presents results on MUSIC as a function of the training source: single-source videos (solo) or multi-source videos (solo + duet). Our method consistently outperforms all baselines in separation accuracy, as captured by the SDR and SIR metrics 6. While the SoP method[52] works well Average audio source separation results on a held out MUSIC test set.…”

mentioning

confidence: 81%

“…Generating Sounds from Video Sound generation methods synthesize a sound track from a visual input [32,54,6]. Given both visual input and monaural audio, recent methods generate spatial (binaural or ambisonic) audio [13,30].…”

Section: Related Workmentioning

confidence: 99%

“…The separated spectrograms for these adaptable objects are also trained to match their category label by the object-consistency loss in Eq. (6).…”

Section: Co-separation Frameworkmentioning

confidence: 99%

See 2 more Smart Citations

Co-Separating Sounds of Visual Objects

Gao

Grauman

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

192

231

View full text Add to dashboard Cite

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of "true" mixed sounds. We introduce a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos. Our novel training objective requires that the deep neural network's separated audio for similar-looking objects be consistently identifiable, while simultaneously reproducing accurate videolevel audio tracks for each source training pair. Our approach disentangles sounds in realistic test videos, even in cases where an object was not observed individually during training. We obtain state-of-the-art results on visuallyguided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.where M V 1 and M V 2 are the ground-truth spectrogram ratio masks for the two videos, respectively. Namely,

show abstract

mentioning

confidence: 81%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Co-Separating Sounds of Visual Objects

Gao

Grauman

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

192

231

View full text Add to dashboard Cite

show abstract

“…3 [45] DATASET AND OXFORD-102 [46] DATASET. WE COMPARE OUR MODEL WITH "TWO-STAGE" METHOD, CLASSIFIER-BASED METHOD [11], AND TEXT-TO-IMAGE MODELS [3], [4]. AS ILLUSTRATED IN THE TABLE,…”

Section: B Experimental Results On Synthesized Datamentioning

confidence: 99%

Direct Speech-to-Image Translation

Zhang

Jia

et al. 2020

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In this paper, we attempt to translate the speech signals into the image signals without the transcription stage. Specifically, a speech encoder is designed to represent the input speech signals as an embedding feature, and it is trained with a pretrained image encoder using teacher-student learning to obtain better generalization ability on new classes. Subsequently, a stacked generative adversarial network is used to synthesize high-quality images conditioned on the embedding feature. Experimental results on both synthesized and real data show that our proposed method is effective to translate the raw speech signals into images without the middle text representation. Ablation study gives more insights about our method.

show abstract

“…Network settings and training details are in the materials 6. The deer class of CIFAR-10 is removed due to its absence in the Ima-geNet dataset.…”

mentioning

confidence: 99%

Listen to the Image

Hu¹,

Wang

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Visual-to-auditory sensory substitution devices can assist the blind in sensing the visual environment by translating the visual information into a sound pattern. To improve the translation quality, the task performances of the blind are usually employed to evaluate different encoding schemes. In contrast to the toilsome human-based assessment, we argue that machine model can be also developed for evaluation, and more efficient. To this end, we firstly propose two distinct cross-modal perception model w.r.t. the late-blind and congenitally-blind cases, which aim to generate concrete visual contents based on the translated sound. To validate the functionality of proposed models, two novel optimization strategies w.r.t. the primary encoding scheme are presented. Further, we conduct sets of human-based experiments to evaluate and compare them with the conducted machine-based assessments in the cross-modal generation task. Their highly consistent results w.r.t. different encoding schemes indicate that using machine model to acceler-

show abstract

Deep Cross-Modal Audio-Visual Generation

Cited by 197 publications

References 29 publications

Co-Separating Sounds of Visual Objects

Co-Separating Sounds of Visual Objects

Direct Speech-to-Image Translation

Listen to the Image

Contact Info

Product

Resources

About