“…By integrating the audio and visual information in multimodal scenes, it is expected to explore more sufficient scene information and overcome the limited perception in single modality. Recently, there have been several works utilizing audio and visual modality to facilitate multimodal scene understanding in different perspectives, such as sound source localization [23,31,34,37,48] and separation [10,13,41,59,61,63], audio inpainting [62], event localization [4,43,64], action recognition [14], video parsing [42,47], captioning [24,40,50], and dialog [1,66].…”