Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects. To this end, we propose a space-time memory network for sounding object localization in videos. It can simultaneously learn spatio-temporal attention over both uni-modal and cross-modal representations from audio and visual modalities. We show and analyze both quantitatively and qualitatively the effectiveness of incorporating spatio-temporal learning in localizing audio-visual objects. We demonstrate that our approach generalizes over various complex audio-visual scenes and outperforms recent state-of-the-art methods. Code and data can be found at https://sites.google.com/view/bmvc2021stm.
IntroductionNeurological evidence suggests that human understandings of scenes predominantly rely on the integration of visual and auditory cues [3]. As humans, we form attention mechanisms to sounding sources by leveraging the temporal, cross-modal alignments between vision and sound, where understandings of the past tell us where and what to attend to next. For computational models, although there have been several developed sound source spatial localization frameworks [21,22,27], how much we gain from explicitly leveraging temporal correspondence that exists naturally in both videos and audios is yet to be explored.However, considerations of temporal coherence are required to facilitate consistent understandings in complex scenes. Imagine a person playing a guitar in front of a wall of not-in-use guitars. In order to figure out which guitar is sounding and obtain stable localization results, we must take multiple timesteps into account. Hence, it is worthwhile to explore if learning temporal cues could benefit the localization of sounding objects in videos.To localize visual objects associated with specific sound sources, most of the previous works capture audio-visual spatial correspondence using similarities between audio and visual modalities [2,15,21], cross-modal attention mechanisms [25,27], and sounding class activation mapping [22]. Nevertheless, these methods often identify sounding objects for static images, and audio-visual temporal coherence in videos is commonly ignored.