2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.622
|View full text |Cite
|
Sign up to set email alerts
|

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Abstract: Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth of detailed information about the physical world. However, most labelled video datasets represent high-level conc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
862
0

Year Published

2018
2018
2019
2019

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 1,066 publications
(897 citation statements)
references
References 38 publications
2
862
0
Order By: Relevance
“…Recently, crowd-acted and fine-grained datasets [8,19,4,7] receive more and more favor and attention. These newly collected datasets pose new challenges for action recognition.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Recently, crowd-acted and fine-grained datasets [8,19,4,7] receive more and more favor and attention. These newly collected datasets pose new challenges for action recognition.…”
Section: Related Workmentioning
confidence: 99%
“…Both of them use late a fusion strategy. Although these 2D networks perform well on datasets like UCF101 [21] or Kinetics [3], they show much less satisfactory results on datasets that require extensive temporal reasoning [8,13]. In another branch, 3D networks(e.g.…”
Section: Temporal Modelingmentioning
confidence: 99%
See 1 more Smart Citation
“…The focus of attention is represented by a heatmap indicating the likelihood of where an action is taking place or where an object is being manipulated in each frame. These attention maps can enhance video representation and improve both action and object recognition, yielding very competitive performance on Something-something [11] dataset. We show that the attention maps are intuitive and interpretable, enabling better video understanding and model diagnosis.…”
Section: Introductionmentioning
confidence: 99%
“…• We show that multi-modal self-supervision, applied to both source and unlabelled target data, can be used for Figure 2: Fine-grained action datasets [8,17,26,28,38,42,46,47,50], x-axis: number of action segments per environment (ape), y-axis: dataset size divided by ape. EPIC-Kitchens [8] offers the largest ape relative to its size.…”
Section: Introductionmentioning
confidence: 99%