Audiovisual SlowFast Networks for Video Recognition

Xiao, Fanyi; Lee, Yong Jae; Grauman, Kristen; Malik, Jitendra; Feichtenhofer, Christoph

doi:10.48550/arxiv.2001.08740

Cited by 50 publications

(61 citation statements)

References 81 publications

(136 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Especially, with similar backbone and pretraining settings, our LSTC can outperform the SlowFast [7], LFB [30] and C-RCNN [31] counterpart with marginal computation cost (SlowFast processes 28.6 clips per second while ours achieves a processing rate of 27.5 clips in each second). It is also noticeable that the performance of our model with solely short-term context (25.6 on v2.1 and 26.1 on v2.2) is still better than SlowFast [7], AVSlowFast [32] and comparable to LFB [30]. These comparison demonstrate that our LSTC is well suitable for atomic action detection.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 79%

“…In Table 4 and Table 5, we list the comparison results on the standard AVA benchmarks with other methods. It can be observed that on method backbone pretrain mAP@0.5 SlowFast [7] Res50 Kinetics-400 24.9 SlowFast [7] Res101-NL Kinetics-600 29.2 AVSF [32] Res50 Kinetics-400 25.9 AVSF [32] Res101-NL Kinetics-400 both version of AVA validation set, our method outperforms most of other methods. Especially, with similar backbone and pretraining settings, our LSTC can outperform the SlowFast [7], LFB [30] and C-RCNN [31] counterpart with marginal computation cost (SlowFast processes 28.6 clips per second while ours achieves a processing rate of 27.5 clips in each second).…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 89%

See 1 more Smart Citation

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

Zhang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

In this paper, we place the atomic action detection problem into a Long-Short Term Context (LSTC) to analyze how the temporal reliance among video signals affect the action detection results. To do this, we decompose the action recognition pipeline into shortterm and long-term reliance, in terms of the hypothesis that the two kinds of context are conditionally independent given the objective action instance. Within our design, a local aggregation branch is utilized to gather dense and informative short-term cues, while a high order long-term inference branch is designed to reason the objective action class from high-order interaction between actor and other person or person pairs. Both branches independently predict the context-specific actions and the results are merged in the end. We demonstrate that both temporal grains are beneficial to atomic action recognition. On the mainstream benchmarks of atomic action detection, our design can bring significant performance gain from the existing state-of-the-art pipeline. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 79%

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 89%

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

Zhang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Multi-modal learning. The idea of learning from more than one modality can be seen as an integral part of machine learning research, comprising i.a., areas such as visionlanguage based learning [34,45], zero-shot learning [20,27], as well as vision-audio learning [10,38,41]. Video naturally combines multiple modalities, while at the same time allowing to learn from large-scale data that would not be annotatable in a reasonable time.…”

Section: Related Workmentioning

confidence: 99%

“…In this case random modality dropout could be used during training as e.g. done in AVSlowfast [41] or Perceiver [21].…”

Section: Multi-modal Fusion Transformermentioning

confidence: 99%

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Shvetsova¹,

Chen²,

Rouditchenko³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.

show abstract

“…In the latter, the sounds are taken from internet video and thus contain a much wider range of auditory events than what we consider in this work. Later work simultaneously learned audio and visual representations [26,27,28,29,30,31]. Other work has learned cross-modal distillation [32], sound source localization [33,34,35,27,36,37,38,39,40], active speaker detection [41,42,43], source separation [44,45,46,47].…”

Section: Static Motionmentioning

confidence: 99%

Structure from Silence: Learning Scene Structure from Ambient Sound

Chen¹,

Hu²,

Owens³

2021

Preprint

View full text Add to dashboard Cite

https://ificl.github.io/structure-from-silence (a) Quiet Campus dataset (b) Depth estimation (c) Multimodal self-supervision ( ) , ( ) , (a) Quiet Campus dataset (b) Depth estimation (c) Multimodal self-supervision Figure 1: What can ambient sound tell us about 3D scene structure? (a) We collect an "in-the-wild" dataset of paired audio and RGB-D recordings from quiet indoor scenes. (b) Given audio from a scene, we estimate distance to a wall. (c) We use this ambient sound to learn audio-visual representations through self-supervision.

show abstract

Audiovisual SlowFast Networks for Video Recognition

Cited by 50 publications

References 81 publications

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Structure from Silence: Learning Scene Structure from Ambient Sound

Contact Info

Product

Resources

About