Alex Andonian scite author profile

Temporal relational reasoning, the ability to link meaningful transformations of objects or entities over time, is a fundamental property of intelligent species. In this paper, we introduce an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales. We evaluate TRN-equipped networks on activity recognition tasks using three recent video datasets -Something-Something, Jester, and Charades -which fundamentally depend on temporal relational reasoning. Our results demonstrate that the proposed TRN gives convolutional neural networks a remarkable capacity to discover temporal relations in videos. Through only sparsely sampled video frames, TRN-equipped networks can accurately predict human-object interactions in the Something-Something dataset and identify various human gestures on the Jester dataset with very competitive performance. TRN-equipped networks also outperform two-stream networks and 3D convolution networks in recognizing daily activities in the Charades dataset. Further analyses show that the models learn intuitive and interpretable visual common sense knowledge in videos 1 .

show abstract

Moments in Time Dataset: One Million Videos for Event Understanding

Monfort

Vondrick

Oliva

et al. 2020

IEEE Trans. Pattern Anal. Mach. Intell.

382

336

View full text Add to dashboard Cite

We present the Moments in Time Dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds. Modeling the spatial-audio-temporal dynamics even for actions occurring in 3 second videos poses many challenges: meaningful events do not include only people, but also objects, animals, and natural phenomena; visual and auditory events can be symmetrical in time ("opening" is "closing" in reverse), and either transient or sustained. We describe the annotation process of our dataset (each video is tagged with one action or activity label among 339 different classes), analyze its scale and diversity in comparison to other large-scale video datasets for action recognition, and report results of several baseline models addressing separately, and jointly, three modalities: spatial, temporal and auditory. The Moments in Time dataset, designed to have a large coverage and diversity of events in both visual and auditory modalities, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis.

show abstract

GANalyze: Toward visual definitions of cognitive image properties

Goetschalckx

Andonian²,

Oliva³

et al. 2020

Journal of Vision

109

184

View full text Add to dashboard Cite

Cross-View Semantic Segmentation for Sensing Surroundings

Pan

Sun

Leung

et al. 2020

IEEE Robot. Autom. Lett.

136

View full text Add to dashboard Cite

Paint by Word

Andonian¹,

Osmany²,

Cui³

et al. 2021

Preprint

View full text Add to dashboard Cite

Temporal Relational Reasoning in Videos

Zhou¹,

Andonian²,

Oliva³

et al. 2017

Preprint

View full text Add to dashboard Cite

Moments in Time Dataset: one million videos for event understanding

Monfort¹,

Andonian²,

Zhou³

et al. 2018

Preprint

View full text Add to dashboard Cite

Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

Monfort¹,

Pan²,

Ramakrishnan

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Alex Andonian

Temporal Relational Reasoning in Videos

Moments in Time Dataset: One Million Videos for Event Understanding

GANalyze: Toward visual definitions of cognitive image properties

Cross-View Semantic Segmentation for Sensing Surroundings

Paint by Word

Temporal Relational Reasoning in Videos

Moments in Time Dataset: one million videos for event understanding

Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

Contact Info

Product

Resources

About