Learning to Separate Object Sounds by Watching Unlabeled Video

Gao, Ruohan; Feris, Rogério; Grauman, Kristen

doi:10.1007/978-3-030-01219-9_3

Cited by 249 publications

(221 citation statements)

References 84 publications

Supporting

Mentioning

220

Contrasting

Order By: Relevance

“…The most related works are [28] and [10]. In [10], a convolutional network is used to predict the type of objects appeared in the video, and Non-negative Matrix Factorization [9] is used to extract a set of basic components. The association between each object and each basic component will be estimated via a Multi-Instance Multi-Label objective.…”

Section: Related Workmentioning

confidence: 99%

“…This would result in a feature tensor of size T × (H/16) × (W/16) × k. In both training and testing, this feature tensor will be reduced into a vector to represent the visual content by performing max pooling along the first three dimensions. On top of this solo video collection, we then follow the Mix-and-Separate strategy as in [28,10] to construct the mixed video/sound data, where each sample mixes n videos, called a mix-n sample.…”

Section: Training and Testing Detailsmentioning

confidence: 99%

“…Existing works [10,28] on visual sound separation mainly separate each sound independently. They assume either fixed types or fixed numbers of sounds, separating sounds independently.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Recursive Visual Sound Separation Using Minus-Plus Net

Dai

Lin

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Sounds provide rich semantics, complementary to visual data, for many tasks. However, in practice, sounds from multiple sources are often mixed together. In this paper we propose a novel framework, referred to as MinusPlus Network (MP-Net), for the task of visual sound separation. MP-Net separates sounds recursively in the order of average energy 1 , removing the separated sound from the mixture at the end of each prediction, until the mixture becomes empty or contains only noise. In this way, MP-Net could be applied to sound mixtures with arbitrary numbers and types of sounds. Moreover, while MP-Net keeps removing sounds with large energy from the mixture, sounds with small energy could emerge and become clearer, so that the separation is more accurate. Compared to previous methods, MP-Net obtains state-of-the-art results on two large scale datasets, across mixtures with different types and numbers of sounds.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Training and Testing Detailsmentioning

confidence: 99%

See 1 more Smart Citation

Recursive Visual Sound Separation Using Minus-Plus Net

Dai

Lin

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…A related topic is generating speech by measuring vibrations in a video [14]. Follow up works include separating input audio signals into a set of components that corresponds to different objects in the given video [20], and separating audio corresponding to each pixel [46].…”

Section: Related Workmentioning

confidence: 99%

To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations

Ahuja

Morency

et al. 2019

2019 International Conference on Multimodal Interaction

View full text Add to dashboard Cite

Non verbal behaviours such as gestures, facial expressions, body posture, and para-linguistic cues have been shown to complement or clarify verbal messages. Hence to improve telepresence, in form of an avatar, it is important to model these behaviours, especially in dyadic interactions. Creating such personalized avatars not only requires to model intrapersonal dynamics between a avatar's speech and their body pose, but it also needs to model interpersonal dynamics with the interlocutor present in the conversation. In this paper, we introduce a neural architecture named Dyadic Residual-Attention Model (DRAM), which integrates intrapersonal (monadic) and interpersonal (dyadic) dynamics using selective attention to generate sequences of body pose conditioned on audio and body pose of the interlocutor and audio of the human operating the avatar. We evaluate our proposed model on dyadic conversational data consisting of pose and audio of both participants, confirming the importance of adaptive attention between monadic and dyadic dynamics when predicting avatar pose. We also conduct a user study to analyze judgments of human observers. Our results confirm that the generated body pose is more natural, models intrapersonal dynamics and interpersonal dynamics better than non-adaptive monadic/dyadic models.

show abstract

“…Xu et al (2017) employ AudioSet for weakly supervised audio event detection, whereas Jansen et al (2017) extract semantic representations from non-speech audio following an unsupervised approach. Since the dataset is intended for general purpose audio event classification, it is suitable for a variety of problems related to audio, such as music o video processing (Gao et al, 2018;Zhou et al, 2018).…”

Section: Audiosetmentioning

confidence: 99%

Machine learning for music genre: multifaceted review and experimentation with audioset

Ramírez

Flores

2019

J Intell Inf Syst

View full text Add to dashboard Cite

Music genre classification is one of the sub-disciplines of music information retrieval (MIR) with growing popularity among researchers, mainly due to the already open challenges. Although research has been prolific in terms of number of published works, the topic still suffers from a problem in its foundations: there is no clear and formal definition of what genre is. Music categorizations are vague and unclear, suffering from human subjectivity and lack of agreement. In its first part, this paper offers a survey trying to cover the many different aspects of the matter. Its main goal is give the reader an overview of the history and the current state-of-the-art, exploring techniques and datasets used to the date, as well as identifying current challenges, such as this ambiguity of genre definitions or the introduction of human-centric approaches. The paper pays special attention to new trends in machine learning applied to the music annotation problem. Finally, we also include a music genre classification experiment that compares different machine learning models using Audioset.

show abstract

Learning to Separate Object Sounds by Watching Unlabeled Video

Cited by 249 publications

References 84 publications

Recursive Visual Sound Separation Using Minus-Plus Net

Recursive Visual Sound Separation Using Minus-Plus Net

To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations

Machine learning for music genre: multifaceted review and experimentation with audioset

Contact Info

Product

Resources

About