Music Gesture for Visual Sound Separation

Gan, Chuang; Huang, Deng; Zhao, Hang; Tenenbaum, Joshua B.; Torralba, Antonio

doi:10.1109/cvpr42600.2020.01049

Cited by 183 publications

(139 citation statements)

References 46 publications

Supporting

Mentioning

132

Contrasting

Unclassified

Order By: Relevance

“…Despite the variety of architectures, existing multimodal networks are mostly designed for combining vision and language and, less frequently, audio [ 39 , 40 ]. For example, refs.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Classification of Parkinson’s Disease in Home Environments with Resiliency to Missing Modalities

Heidarivincheh

McConville

Morgan

et al. 2021

Sensors

View full text Add to dashboard Cite

Parkinson’s disease (PD) is a chronic neurodegenerative condition that affects a patient’s everyday life. Authors have proposed that a machine learning and sensor-based approach that continuously monitors patients in naturalistic settings can provide constant evaluation of PD and objectively analyse its progression. In this paper, we make progress toward such PD evaluation by presenting a multimodal deep learning approach for discriminating between people with PD and without PD. Specifically, our proposed architecture, named MCPD-Net, uses two data modalities, acquired from vision and accelerometer sensors in a home environment to train variational autoencoder (VAE) models. These are modality-specific VAEs that predict effective representations of human movements to be fused and given to a classification module. During our end-to-end training, we minimise the difference between the latent spaces corresponding to the two data modalities. This makes our method capable of dealing with missing modalities during inference. We show that our proposed multimodal method outperforms unimodal and other multimodal approaches by an average increase in F1-score of 0.25 and 0.09, respectively, on a data set with real patients. We also show that our method still outperforms other approaches by an average increase in F1-score of 0.17 when a modality is missing during inference, demonstrating the benefit of training on multiple modalities.

show abstract

“…Despite the variety of architectures, existing multimodal networks are mostly designed for combining vision and language and, less frequently, audio [ 39 , 40 ]. For example, refs.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Classification of Parkinson’s Disease in Home Environments with Resiliency to Missing Modalities

Heidarivincheh

McConville

Morgan

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…Recently, several approaches that solve for the alignment of various modalities [19,20,21,11,22,12,23,24] have also been suggested. Music gesture [25] uses the keypoint-based structured representation to explicitly model the body's and finger's dynamics motion cues for visual sound separation. A few very recent works have also explored the multimodal generation problem.…”

Section: Related Workmentioning

confidence: 99%

Collaborative Learning to Generate Audio-Video Jointly

Kurmi

Bajaj

Patro

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are able to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video modalities. The proposed method uses multiple discriminators to ensure that the audio, video, and the joint output are also indistinguishable from real-world samples. We present a dataset for this task and show that we are able to generate realistic samples. This method is validated using various standard metrics such as Inception Score, Frechet Inception Distance (FID) and through human evaluation.

show abstract

“…Gao et al [11] proposed a model to detect each musical instrument in a video clip of multiple sounds and divide the sound emitted from each instrument. Gan et al [10] improved the performance of time-frequency mask estimation for sound source separation using a context-aware graph network to extract information from the time series of the performer key points. In the research on human speech segmentation, a method to predict complex ratio masks with the amplitude and phase information was proposed to extract speech from the spectrogram of synthetic speech [2], [9].…”

Section: B Sound Separationmentioning

confidence: 99%

Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos

Uchiyama

Kawamoto

2021

IEEE Access

View full text Add to dashboard Cite

We present an audio-visual model for generating food texture sounds from silent eating videos. We designed a deep network-based model that takes the visual features of the detected faces as input and outputs a magnitude spectrogram that aligns with the visual streams. Generating raw waveform samples directly from a given input visual stream is challenging; in this study, we used the Griffin-Lim algorithm for phase recovery from the predicted magnitude to generate raw waveform samples using inverse shorttime Fourier transform. Additionally, we produced waveforms from these magnitude spectrograms using an example-based synthesis procedure. To train the model, we created a dataset containing several food autonomous sensory meridian response videos. We evaluated our model on this dataset and found that the predicted sound features exhibit appropriate temporal synchronization with the visual inputs. Our subjective evaluation experiments demonstrated that the predicted sounds are considerably realistic to fool participants in a "real" or "fake" psychophysical experiment.INDEX TERMS Multi-modal deep neural network, Autonomous sensory meridian response, Eating sound generation

show abstract

Music Gesture for Visual Sound Separation

Abstract: Figure 1: We propose to leverage explicit body dynamics motion cues for visual sound separation in music performances.We show that our new model can perform well on both heterogeneous and homogeneous music separation tasks.

Cited by 183 publications

References 46 publications

Multimodal Classification of Parkinson’s Disease in Home Environments with Resiliency to Missing Modalities

Multimodal Classification of Parkinson’s Disease in Home Environments with Resiliency to Missing Modalities

Collaborative Learning to Generate Audio-Video Jointly

Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos

Contact Info

Product

Resources

About