The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Goyal, Rajiva; Kahou, Samira Ebrahimi; Michalski, Vincent; Materzyńska, Joanna; Westphal, Susanne; Kim, Heuna; Haenel, Valentin; Fruend, Ingo; Yianilos, P.N.; Mueller-Freitag, Moritz; Hoppe, Florian; Thurau, Christian; Bax, Ingo; Memisevic, Roland

doi:10.1109/iccv.2017.622

Cited by 1,066 publications

(897 citation statements)

References 38 publications

Supporting

Mentioning

862

Contrasting

Order By: Relevance

“…Recently, crowd-acted and fine-grained datasets [8,19,4,7] receive more and more favor and attention. These newly collected datasets pose new challenges for action recognition.…”

Section: Related Workmentioning

confidence: 99%

“…Both of them use late a fusion strategy. Although these 2D networks perform well on datasets like UCF101 [21] or Kinetics [3], they show much less satisfactory results on datasets that require extensive temporal reasoning [8,13]. In another branch, 3D networks(e.g.…”

Section: Temporal Modelingmentioning

confidence: 99%

“…The two examples are similar at the Figure 1. Examples from the something-something dataset [8]. The groundtruth for the two videos are "throwing something in the air and catching it" and "throwing something in the air and letting it fall".…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

Luo

Yuille

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

162

View full text Add to dashboard Cite

Temporal reasoning is an important aspect of video analysis. 3D CNN shows good performance by exploring spatial-temporal features jointly in an unconstrained way, but it also increases the computational cost a lot. Previous works try to reduce the complexity by decoupling the spatial and temporal filters. In this paper, we propose a novel decomposition method that decomposes the feature channels into spatial and temporal groups in parallel. This decomposition can make two groups focus on static and dynamic cues separately. We call this grouped spatial-temporal aggregation (GST). This decomposition is more parameterefficient and enables us to quantitatively analyze the contributions of spatial and temporal features in different layers. We verify our model on several action recognition tasks that require temporal reasoning and show its effectiveness.

show abstract

“…Recently, crowd-acted and fine-grained datasets [8,19,4,7] receive more and more favor and attention. These newly collected datasets pose new challenges for action recognition.…”

Section: Related Workmentioning

confidence: 99%

Section: Temporal Modelingmentioning

confidence: 99%

See 1 more Smart Citation

Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

Luo

Yuille

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

162

View full text Add to dashboard Cite

show abstract

“…The focus of attention is represented by a heatmap indicating the likelihood of where an action is taking place or where an object is being manipulated in each frame. These attention maps can enhance video representation and improve both action and object recognition, yielding very competitive performance on Something-something [11] dataset. We show that the attention maps are intuitive and interpretable, enabling better video understanding and model diagnosis.…”

Section: Introductionmentioning

confidence: 99%

Reasoning About Human-Object Interactions Through Dual Attention Networks

Xiao

Fan²,

Gutfreund

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…• We show that multi-modal self-supervision, applied to both source and unlabelled target data, can be used for Figure 2: Fine-grained action datasets [8,17,26,28,38,42,46,47,50], x-axis: number of action segments per environment (ape), y-axis: dataset size divided by ape. EPIC-Kitchens [8] offers the largest ape relative to its size.…”

Section: Introductionmentioning

confidence: 99%

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Munro

Damen

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Fine-grained action recognition datasets exhibit environmental bias, where multiple video sequences are captured from a limited number of environments. Training a model in one environment and deploying in another results in a drop in performance due to an unavoidable domain shift. Unsupervised Domain Adaptation (UDA) approaches have frequently utilised adversarial training between the source and target domains. However, these approaches have not explored the multi-modal nature of video within each domain. In this work we exploit the correspondence of modalities as a self-supervised alignment approach for UDA in addition to adversarial alignment (Fig. 1).We test our approach on three kitchens from our largescale dataset, EPIC-Kitchens [8], using two modalities commonly employed for action recognition: RGB and Optical Flow. We show that multi-modal self-supervision alone improves the performance over source-only training by 2.4% on average. We then combine adversarial training with multi-modal self-supervision, showing that our approach outperforms other UDA methods by 3%.

show abstract

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Cited by 1,066 publications

References 38 publications

Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

Reasoning About Human-Object Interactions Through Dual Attention Networks

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Contact Info

Product

Resources

About