MPN: Multimodal Parallel Network for Audio-Visual Event Localization

Yu, Jiashuo; Cheng, Ying; Feng, Rui

doi:10.1109/icme51207.2021.9428373

Cited by 15 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2) To achieve better performances, the Global-Local [16] samples video frames at 10 FPS on AVE [5] dataset for data augmentation. In our work, in order to keep the consistent experiment setup, we keep 1 FPS in our method as the same as existing literature [5]- [11], [74]. 3) The authors of Global-Local [16] suggest using 16 Tesla P100 GPU to handle the large-scale dataset of 240k videos while our model can be trained very lightly with just one GTX 1080 GPU without extra data.…”

Section: B2 More Discussion On the Comparison To Self-supervised Methodsmentioning

confidence: 99%

Contrastive Positive Sample Propagation Along the Audio-Visual Event Line

Zhou

Guo

Wang

2023

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. It is pivotal to learn the discriminative features for each video segment. Unlike existing work focusing on audio-visual feature fusion, in this paper, we propose a new contrastive positive sample propagation (CPSP) method for better deep feature representation learning. The contribution of CPSP is to introduce the available full or weak label as a prior that constructs the exact positive-negative samples for contrastive learning. Specifically, the CPSP involves comprehensive contrastive constraints: pair-level positive sample propagation (PSP), segment-level and video-level positive sample activation (PSA S and PSA V ). Three new contrastive objectives are proposed (i.e., Lavpsp, Lspsa, and Lvpsa) and introduced into both the fully and weakly supervised AVE localization. To draw a complete picture of the contrastive learning in AVE localization, we also study the self-supervised positive sample propagation (SSPSP). As a result, CPSP is more helpful to obtain the refined audio-visual features that are distinguishable from the negatives, thus benefiting the classifier prediction. Extensive experiments on the AVE and the newly collected VGGSound-AVEL100k datasets verify the effectiveness and generalization ability of our method.

show abstract

Section: B2 More Discussion On the Comparison To Self-supervised Methodsmentioning

confidence: 99%

Contrastive Positive Sample Propagation Along the Audio-Visual Event Line

Zhou

Guo

Wang

2023

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…Yu et al. (2021) applied the self‐attention module and cross‐modal attention module to provide precise event localization results. H. Chen et al.…”

Section: Related Workmentioning

confidence: 99%

“…Wu et al (2019) proposed dual attention matching that used each feature for guidance for other features and outperformed the other state-of-the-art methods on the localization task. Yu et al (2021) applied the self-attention module and cross-modal attention module to provide precise event localization results. H. Chen et al (2021) thus evaluated the localization task of the objects with annotated bounding boxes by calculating the cosine similarity between the visual and auditory features.…”

Section: Multi-modal Learningmentioning

confidence: 99%

“…In particular, this study aimed to improve detection performance by using not only the visual features but also the auditory features. Recent studies that used visual and auditory signals have confirmed the correlation or joint representation of visual and auditory features by adopting various attention mechanisms that were used in sound source localization or classification task (H. Chen et al, 2021;Valverde et al, 2021;Yu et al, 2021). Since an auditory signal contains location and property information of the object and action in an image, information relevant to each task should be emphasized.…”

Section: Multi-head Attention and Detection Modulementioning

confidence: 99%

See 1 more Smart Citation

Visual–auditory learning network for construction equipment action detection

Jung

Jeoung

Lee

et al. 2023

Computer aided Civil Eng

View full text Add to dashboard Cite

Action detection of construction equipment is critical for tracking project performance, facilitating construction automation, and fostering construction efficiency in terms of construction site monitoring. Particularly, the auditory signal can provide additional information on computer vision‐based action detection of various types of construction equipment. Therefore, this study aims to develop a visual–auditory learning network model for the action detection of construction equipment based on two modalities (i.e., vision and audition). To this end, both visual and auditory features are extracted from the multi‐modal feature extractor. In addition, the multi‐head attention and detection module is designed to conduct the localization and classification tasks in separate heads in which different attention mechanisms for each task are applied. Particularly, the content‐based attention mechanism and the dot‐product attention mechanism are, respectively, adopted for spatial attention in the localization head and channel attention in the classification head. The evaluation results show that the precision and recall of the proposed model can reach 86.92% and 84.00% with the adoption of the multi‐head attention and detection module, which has proven to improve overall detection performance by utilizing different correlations of visual and auditory features for localization and classification, respectively.

show abstract