Moments in Time Dataset: one million videos for event understanding

Monfort, Mathew; Andonian, Alex; Zhou, Bolei; Ramakrishnan, K. R.; Bargal, Sarah Adel; Yan, Tom; Brown, Lisa M.; Fan, Quanfu; Dan, Gutfruend,; Vondrick, Carl; Oliva, Aude

doi:10.48550/arxiv.1801.03150

Cited by 26 publications

(39 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We validate S3-Net by evaluations on 1 large-scale segmentation dataset: CityScapes [10] and 3 challenging activity recognition datasets: UCF11 [23], HMDB51 [20] and MO-MENTS [28].…”

Section: Methodsmentioning

confidence: 99%

S3-Net: A Fast and Lightweight Video Scene Understanding Network by Single-shot Segmentation

Cheng

Yang

Chen

et al. 2020

Preprint

View full text Add to dashboard Cite

Real-time understanding in video is crucial in various AI applications such as autonomous driving. This work presents a fast single-shot segmentation strategy for video scene understanding. The proposed net, called S3-Net, quickly locates and segments target sub-scenes, meanwhile extracts structured time-series semantic features as inputs to an LSTM-based spatio-temporal model. Utilizing tensorization and quantization techniques, S3-Net is intended to be lightweight for edge computing. Experiments using CityScapes, UCF11, HMDB51 and MOMENTS datasets demonstrate that the proposed S3-Net achieves an accuracy improvement of 8.1% versus the 3D-CNN based approach on UCF11, a storage reduction of 6.9× and an inference speed of 22.8 FPS on CityScapes with a GTX1080Ti GPU.

show abstract

“…We validate S3-Net by evaluations on 1 large-scale segmentation dataset: CityScapes [10] and 3 challenging activity recognition datasets: UCF11 [23], HMDB51 [20] and MO-MENTS [28].…”

Section: Methodsmentioning

confidence: 99%

S3-Net: A Fast and Lightweight Video Scene Understanding Network by Single-shot Segmentation

Cheng

Yang

Chen

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Image understanding is a long standing problem in computer vision, and despite incredible advances, obtaining the best visual representation for a variety of image understanding tasks is still an active area of research. Videos, in addition to addressing a similar image understanding task, require employing effective spatialtemporal processing of both RGB and time streams to capture long-range interactions [5,36,21,17,23,12,33,20,24,1]. An important aspect of this understanding is how to quickly learn which parts of the input video stream are important, both spatially and temporally, and to focus computational resources on them.…”

Section: Introductionmentioning

confidence: 99%

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

Ryoo¹,

Piergiovanni²,

Arnab³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount.Recent advancements in image understanding demonstrate improved accuracy on vision classification tasks. For example, departing from standard convolutional approaches, the Vision Transformer (VIT) [9] treats the image as a sequence of patches, utilizing the Transformer architecture [38] similar to text understanding.Standard approaches for video recognition take videos as stacked images (i.e., a space-time volume) and tend to extend 2D neural architectures to 3D (e.g., 5,37,11]). In parallels to the Vision Transformer for images, some approaches [2,3] proposed to create 3D 'cubelet' video tokens on a regular 3D-grid which are further processed by a Transformer, resulting in computationally heavy models. There are too many tokens to process, especially for longer videos.The main question addressed in this work is how to adaptively learn the representation from visual inputs to most effectively capture the spatial information for images and spatio-temporal interactions for videos. Here are our main ideas:Preprint. Under review.

show abstract

“…Over the last few years, video action recognition has made rapid progress with the introduction of a number of large-scale video datasets (Carreira & Zisserman, 2017;Monfort et al, 2018;Goyal et al, 2017). Despite impressive results on commonly used benchmark datasets, efficiency remains a great challenge for many resource constrained applications due to the heavy computational burden of deep Convolutional Neural Network (CNN) models.…”

Section: Introductionmentioning

confidence: 99%

AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition

Yue¹,

Panda²,

Lin³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on Something V1&V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AdaFuse/

show abstract

Moments in Time Dataset: one million videos for event understanding

Cited by 26 publications

References 0 publications

S3-Net: A Fast and Lightweight Video Scene Understanding Network by Single-shot Segmentation

S3-Net: A Fast and Lightweight Video Scene Understanding Network by Single-shot Segmentation

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition

Contact Info

Product

Resources

About