Motion informed audio source separation

Parekh, Sanjeel; Essid, Slim; Ozerov, Alexey; Duong, Ngoc Q. K.; Pérez, Patrick; Richard, Gaël

doi:10.1109/icassp.2017.7951787

Cited by 45 publications

(28 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Audio-Visual Source Separation Early methods for audio-visual source separation focus on mutual information [10], subspace analysis [42,34], matrix factorization [33,39], and correlated onsets [5,27]. Recent methods leverage deep learning for separating speech [8,31,3,11], musical instruments [52,13,51], and other objects [12].…”

Section: Related Workmentioning

confidence: 99%

Co-Separating Sounds of Visual Objects

Gao

Grauman

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

185

231

View full text Add to dashboard Cite

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of "true" mixed sounds. We introduce a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos. Our novel training objective requires that the deep neural network's separated audio for similar-looking objects be consistently identifiable, while simultaneously reproducing accurate videolevel audio tracks for each source training pair. Our approach disentangles sounds in realistic test videos, even in cases where an object was not observed individually during training. We obtain state-of-the-art results on visuallyguided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.where M V 1 and M V 2 are the ground-truth spectrogram ratio masks for the two videos, respectively. Namely,

show abstract

Section: Related Workmentioning

confidence: 99%

Co-Separating Sounds of Visual Objects

Gao

Grauman

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

185

231

View full text Add to dashboard Cite

show abstract

“…Audio-visual source separation The idea of guiding audio source separation using visual information can be traced back to [15,27], where mutual information is used to learn the joint distribution of the visual and auditory signals, then applied to isolate human speakers. Subsequent work explores audio-visual subspace analysis [62,67], NMF informed by visual motion [61,65], statistical convolutive mixture models [64], and correlating temporal onset events [8,52]. Recent work [62] attempts both localization and separation simultaneously; however, it assumes a moving object is present and only aims to decompose a video into background (assumed low-rank) and foreground sounds/pixels.…”

Section: Audio-visual Representation Learningmentioning

confidence: 99%

“…Recent work [62] attempts both localization and separation simultaneously; however, it assumes a moving object is present and only aims to decompose a video into background (assumed low-rank) and foreground sounds/pixels. Prior methods nearly always tackle videos of people speaking or playing musical instruments [8,12,15,27,52,61,62,64]-domains where salient motion signals accompany audio events (e.g., a mouth or a violin bow starts moving, a guitar string suddenly accelerates). Some studies further assume side cues from a written musical score [52], require that each sound source has a period when it alone is active [12], or use ground-truth motion captured by MoCap [61].…”

Section: Audio-visual Representation Learningmentioning

confidence: 99%

“…Prior attempts at visually-aided audio source separation tackle the problem by detecting low-level correlations between the two data streams for the input video [8,12,15,27,52,61,62,64], and they experiment with somewhat controlled domains of musical instruments in concert or human speakers facing the camera. In contrast, we propose to learn object-level sound models from hundreds of thousands of unlabeled videos, and generalize to separate new audio-visual instances.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning to Separate Object Sounds by Watching Unlabeled Video

Gao

Feris²,

Grauman

2018

Computer Vision – ECCV 2018

249

222

View full text Add to dashboard Cite

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video. We obtain state-of-the-art results on visuallyaided audio source separation and audio denoising. Our video results:

show abstract

“…• Visually Informed Source Separation: Audio events (e.g., a violin note) are often associated with visual movements (e.g., a bowing motion) [5]. Designing methods that can leverage visual information for source separation is an interesting task.…”

Section: B New Tasks Using Both Audio and Visual Modalitiesmentioning

confidence: 99%

Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications

Liu

Dinesh

et al. 2019

IEEE Trans. Multimedia

106

110

View full text Add to dashboard Cite

We introduce a dataset for facilitating audio-visual analysis of music performances. The dataset comprises 44 simple multi-instrument classical music pieces assembled from coordinated but separately recorded performances of individual tracks. For each piece, we provide the musical score in MIDI format, the audio recordings of the individual tracks, the audio and video recording of the assembled mixture, and ground-truth annotation files including frame-level and note-level transcriptions. We describe our methodology for the creation of the dataset, particularly highlighting our approaches for addressing the challenges involved in maintaining synchronization and expressiveness. We demonstrate the high quality of synchronization achieved with our proposed approach by comparing the dataset with existing widely-used music audio datasets.We anticipate that the dataset will be useful for the development and evaluation of existing music information retrieval (MIR) tasks, as well as for novel multi-modal tasks. We benchmark two existing MIR tasks (multi-pitch analysis and scoreinformed source separation) on the dataset and compare with other existing music audio datasets. Additionally, we consider two novel multi-modal MIR tasks (visually informed multi-pitch analysis and polyphonic vibrato analysis) enabled by the dataset and provide evaluation measures and baseline systems for future comparisons (from our recent work). Finally, we propose several emerging research directions that the dataset enables.

show abstract

Motion informed audio source separation

Cited by 45 publications

References 21 publications

Co-Separating Sounds of Visual Objects

Co-Separating Sounds of Visual Objects

Learning to Separate Object Sounds by Watching Unlabeled Video

Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications

Contact Info

Product

Resources

About