The "something something" video database for learning and evaluating visual common sense

Goyal, Raghav; Kahou, Samira Ebrahimi; Michalski, Vincent; Materzyńska, Joanna; Westphal, Susanne; Kim, Heuna; Haenel, Valentin; Fruend, Ingo; Yianilos, P.N.; Mueller-Freitag, Moritz; Hoppe, Florian; Thurau, Christian; Bax, Ingo; Memisevic, Roland

doi:10.48550/arxiv.1706.04261

Cited by 11 publications

(15 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The task formulation is aimed at achieving an "intuitive" figure-understanding system, that does not resort to inverting the visualization pipeline. This is in line with the recent trend in visual-textual datasets, such as those for intuitive physics and reasoning (Goyal et al, 2017;Mun et al, 2016).…”

Section: Related Worksupporting

confidence: 87%

FigureQA: An Annotated Figure Dataset for Visual Reasoning

Kahou,

Michalski,

Atkinson

et al. 2017

Preprint

Self Cite

View full text Add to dashboard Cite

We introduce FigureQA, a visual reasoning corpus of over one million questionanswer pairs grounded in over 100, 000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts. We formulate our reasoning task by generating questions from 15 templates; questions concern various relationships between plot elements and examine characteristics like the maximum, the minimum, area-under-the-curve, smoothness, and intersection. To resolve, such questions often require reference to multiple plot elements and synthesis of information distributed spatially throughout a figure. To facilitate the training of machine learning systems, the corpus also includes side data that can be used to formulate auxiliary objectives. In particular, we provide the numerical data used to generate each figure as well as bounding-box annotations for all plot elements. We study the proposed visual reasoning task by training several models, including the recently proposed Relation Network as a strong baseline. Preliminary results indicate that the task poses a significant machine learning challenge. We envision FigureQA as a first step towards developing models that can intuitively recognize patterns from visual representations of data.

show abstract

Section: Related Worksupporting

confidence: 87%

FigureQA: An Annotated Figure Dataset for Visual Reasoning

Kahou,

Michalski,

Atkinson

et al. 2017

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Many video datasets are available to test models of action recognition or detection, including Hollywood2 [22], La-belMe video [40], UCF101 [31], HMDB51 [21], THUMOS [18], AVA [13], "something something" [12] and Charades [29]. Training deep neural networks for these tasks requires available large video datasets, like ActivityNet [6], Kinetics [19], Moments in Time [26], or YouTube-8M [1].…”

Section: Video Datasets and Modelsmentioning

confidence: 99%

“…Several large-scale video datasets provide a large diversity and coverage in terms of the categories of activities and exemplars they capture [19], [12], [26]. However, these labeled datasets only provide a single annotated label for each video and this label may not cover the rich spectrum of events occurring in the video.…”

Section: Introductionmentioning

confidence: 99%

Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

Monfort¹,

Pan²,

Ramakrishnan³

et al. 2019

Preprint

View full text Add to dashboard Cite

An event happening in the world is often made of different activities and actions that can unfold simultaneously or sequentially within a few seconds. However, most large-scale datasets built to train models for action recognition provide a single label per video clip. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information that would be mandatory to more completely comprehend different events and eventually learn causality between them. Towards this goal, we augmented the existing video dataset, Moments in Time (MiT), to include over two million action labels for over one million three second videos. This multi-label dataset introduces novel challenges on how to train and analyze models for multi-action detection. Here, we present baseline results for multi-action recognition using loss functions adapted for long tail multi-label learning and provide improved methods for visualizing and interpreting models trained for multi-label action detection.

show abstract

“…Additionally in this work the following datasets are used: NIST TRECVID Twitter vines [1], TGIF [18], MSVD [2], YouCook2 [43], Something-something V2 [10], Kinetics 700 [31], HowTo100M [23].…”

Section: Datasetsmentioning

confidence: 99%

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Dzabraev

Kalashnikov

Komkov

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

103

View full text Add to dashboard Cite

We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved with a single model on two datasets without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that training on different datasets can improve test results of each other. Additionally we check intersection between many popular datasets and found that MSRVTT has a significant overlap between the test and the train parts, and the same situation is observed for ActivityNet.

show abstract

The "something something" video database for learning and evaluating visual common sense

Cited by 11 publications

References 0 publications

FigureQA: An Annotated Figure Dataset for Visual Reasoning

FigureQA: An Annotated Figure Dataset for Visual Reasoning

Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Contact Info

Product

Resources

About