AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Gu, Chunhui; Sun, Chen; Ross, David A.; Vondrick, Carl; Pantofaru, Caroline; Li, Yeqing; Vijayanarasimhan, Sudheendra; Toderici, George; Ricco, Susanna; Sukthankar, Rahul; Schmid, Cordelia; Malik, Jitendra

doi:10.48550/arxiv.1705.08421

Cited by 24 publications

(56 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Most of the existing action recognition datasets contain high resolution, actor centric videos [25], [12], [13], [19], [2], [18], [8], [11], [17], [6]. For example, Kinetics [13], Charades [19], Youtube-8M [2] are collected from Youtube videos where actions cover most of the image regions in every frame of a video.…”

Section: Tinyvirat Datasetmentioning

confidence: 99%

“…The availability of large-scale datasets and the progress of neural networks have provided significant improvement to video action recognition task. Datasets with multiple actors and actions such as UCF-101 [21], Kinetics [20,13], AVA [8], YouTube-8M [1] and Moments-in-time [15] provide a large set of data with higher versatility for training neural networks. This has enabled several state-of-the-art architectures such as C3D [22], I3D [3], ResNet-3D [9] and R2+1D [23] which have been effective at recognizing the correct actions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

TinyAction Challenge: Recognizing Real-world Low-resolution Activities in Videos

Tirupattur,

Rana,

Sangam

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper summarizes the TinyAction challenge 1 which was organized in ActivityNet workshop at CVPR 2021. This challenge focuses on recognizing real-world low-resolution activities present in videos. Action recognition task is currently focused around classifying the actions from highquality videos where the actors and the action is clearly visible. While various approaches have been shown effective for recognition task in recent works, they often do not deal with videos of lower resolution where the action is happening in a tiny region. However, many real world security videos often have the actual action captured in a small resolution, making action recognition in a tiny region a challenging task. In this work, we propose a benchmark dataset, TinyVIRAT-v2 2 , which is comprised of naturally occuring low-resolution actions. This is an extension of the TinyVI-RAT dataset [7] and consists of actions with multiple labels. The videos are extracted from security videos which makes them realistic and more challenging. We use current stateof-the-art action recognition methods on the dataset as a benchmark, and propose the TinyAction Challenge.

show abstract

Section: Tinyvirat Datasetmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

TinyAction Challenge: Recognizing Real-world Low-resolution Activities in Videos

Tirupattur,

Rana,

Sangam

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Later, 3D Con-vNets [3,30,36] are shown to perform better in spatiotemporal modeling. With many large-scale video datasets [11,5,2,10] coming out, 3D ConvNets are able to get high accuracies when incorporated into a two-stream framework. However, 3D ConvNets are computationally heavy and there are some efforts like [32,31], trying to alleviate 3D computations and get comparable or even better performances.…”

Section: Related Workmentioning

confidence: 99%

Flow-Distilled IP Two-Stream Networks for Compressed Video Action Recognition

Huang,

Lin,

Karaman

et al. 2019

Preprint

View full text Add to dashboard Cite

Two-stream networks have achieved great success in video recognition. A two-stream network combines a spatial stream of RGB frames and a temporal stream of Optical Flow to make predictions. However, the temporal redundancy of RGB frames as well as the high-cost of optical flow computation creates challenges for both the performance and efficiency. Recent works instead use modern compressed video modalities as an alternative to the RGB spatial stream and improve the inference speed by orders of magnitudes. Previous works create one stream for each modality which are combined with an additional temporal stream through late fusion. This is redundant since some modalities like motion vectors already contain temporal information. Based on this observation, we propose a compressed domain two-stream network (IP TSN) for compressed video recognition, where the two streams are represented by the two types of frames (I and P frames) in compressed videos, without needing a separate temporal stream. With this goal, we propose to fully exploit the motion information of P-stream through generalized distillation from optical flow, which largely improves the efficiency and accuracy. Our P-stream runs 60 times faster than using optical flow while achieving higher accuracy. Our full IP TSN, evaluated over public action recognition benchmarks (UCF101, HMDB51 and a subset of Kinetics), outperforms other compressed domain methods by large margins while improving the total inference speed by 20%.

show abstract

“…Action recognition has become a more and more important topic in the field of academic research as well as in industrial context. This is shown by the amount of publications and the diversity of research directions, as well as by the growing number of challenging datasets in this field [10,5,33,11]. So far, most of these approaches rely on fully supervised training.…”

Section: Introductionmentioning

confidence: 99%

Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data

Kuehne¹,

Iqbal²,

Richard³

et al. 2019

Preprint

View full text Add to dashboard Cite

Action recognition is so far mainly focusing on the problem of classification of hand selected preclipped actions and reaching impressive results in this field. But with the performance even ceiling on current datasets, it also appears that the next steps in the field will have to go beyond this fully supervised classification. One way to overcome those problems is to move towards less restricted scenarios. In this context we present a large-scale real-world dataset designed to evaluate learning techniques for human action recognition beyond hand-crafted datasets. To this end we put the process of collecting data on its feet again and start with the annotation of a test set of 250 cooking videos. The training data is then gathered by searching for the respective annotated classes within the subtitles of freely available videos. The uniqueness of the dataset is attributed to the fact that the whole process of collecting the data and training does not involve any human intervention. To address the problem of semantic inconsistencies that arise with this kind of training data, we further propose a semantical hierarchical structure for the mined classes.We benchmark the proposed dataset with respect to current features and architectures on the task of temporal alignment and show challenges in this field as well as the benefits of semantic models in this context.

show abstract

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Cited by 24 publications

References 0 publications

TinyAction Challenge: Recognizing Real-world Low-resolution Activities in Videos

TinyAction Challenge: Recognizing Real-world Low-resolution Activities in Videos

Flow-Distilled IP Two-Stream Networks for Compressed Video Action Recognition

Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data

Contact Info

Product

Resources

About