2020
DOI: 10.48550/arxiv.2001.08740
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Audiovisual SlowFast Networks for Video Recognition

Abstract: We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce D… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
60
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 50 publications
(61 citation statements)
references
References 81 publications
(136 reference statements)
1
60
0
Order By: Relevance
“…Especially, with similar backbone and pretraining settings, our LSTC can outperform the SlowFast [7], LFB [30] and C-RCNN [31] counterpart with marginal computation cost (SlowFast processes 28.6 clips per second while ours achieves a processing rate of 27.5 clips in each second). It is also noticeable that the performance of our model with solely short-term context (25.6 on v2.1 and 26.1 on v2.2) is still better than SlowFast [7], AVSlowFast [32] and comparable to LFB [30]. These comparison demonstrate that our LSTC is well suitable for atomic action detection.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 79%
See 1 more Smart Citation
“…Especially, with similar backbone and pretraining settings, our LSTC can outperform the SlowFast [7], LFB [30] and C-RCNN [31] counterpart with marginal computation cost (SlowFast processes 28.6 clips per second while ours achieves a processing rate of 27.5 clips in each second). It is also noticeable that the performance of our model with solely short-term context (25.6 on v2.1 and 26.1 on v2.2) is still better than SlowFast [7], AVSlowFast [32] and comparable to LFB [30]. These comparison demonstrate that our LSTC is well suitable for atomic action detection.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 79%
“…In Table 4 and Table 5, we list the comparison results on the standard AVA benchmarks with other methods. It can be observed that on method backbone pretrain mAP@0.5 SlowFast [7] Res50 Kinetics-400 24.9 SlowFast [7] Res101-NL Kinetics-600 29.2 AVSF [32] Res50 Kinetics-400 25.9 AVSF [32] Res101-NL Kinetics-400 both version of AVA validation set, our method outperforms most of other methods. Especially, with similar backbone and pretraining settings, our LSTC can outperform the SlowFast [7], LFB [30] and C-RCNN [31] counterpart with marginal computation cost (SlowFast processes 28.6 clips per second while ours achieves a processing rate of 27.5 clips in each second).…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 89%
“…Multi-modal learning. The idea of learning from more than one modality can be seen as an integral part of machine learning research, comprising i.a., areas such as visionlanguage based learning [34,45], zero-shot learning [20,27], as well as vision-audio learning [10,38,41]. Video naturally combines multiple modalities, while at the same time allowing to learn from large-scale data that would not be annotatable in a reasonable time.…”
Section: Related Workmentioning
confidence: 99%
“…In this case random modality dropout could be used during training as e.g. done in AVSlowfast [41] or Perceiver [21].…”
Section: Multi-modal Fusion Transformermentioning
confidence: 99%
“…In the latter, the sounds are taken from internet video and thus contain a much wider range of auditory events than what we consider in this work. Later work simultaneously learned audio and visual representations [26,27,28,29,30,31]. Other work has learned cross-modal distillation [32], sound source localization [33,34,35,27,36,37,38,39,40], active speaker detection [41,42,43], source separation [44,45,46,47].…”
Section: Static Motionmentioning
confidence: 99%