TubeR: Tubelet Transformer for Video Action Detection

Zhao, Jiaojiao; Zhang, Yanyi; Li, Xinyu; Chen, Hao; Shuai, Bing; Xu, Mingze; Liu, Chunhui; Kundu, Kaustav; Xiong, Yuanjun; Modolo, Davide; Marsic, Ivan; Snoek, Cees G. M.; Tighe, Joseph

doi:10.1109/cvpr52688.2022.01323

Cited by 33 publications

(32 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent works, such as [63,72], highlight the effectiveness of Transformer-based approaches for the task of detecting spatio-temporal tubes in videos. In particular, Tu-beR [72] proposed an end-to-end approach using no proposals or person detectors.…”

Section: Related Workmentioning

confidence: 99%

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Goyal¹,

Mavroudi²,

Yang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Goyal¹,

Mavroudi²,

Yang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Zhao et al. [18] propose an end‐to‐end action detection framework, which can be optimised for modelling action tubes with variable lengths and aspect ratios.…”

Section: Related Workmentioning

confidence: 99%

Exploiting spatio‐temporal knowledge for video action recognition

Zhang

Wang

Sun

2022

IET Computer Vision

View full text Add to dashboard Cite

Action recognition has been a popular area of computer vision research in recent years. The goal of this task is to recognise human actions in video frames. Most existing methods often depend on the visual features and their relationships inside the videos. The extracted features only represent the visual information of the current video itself and cannot represent the general knowledge of particular actions beyond the video. Thus, there are some deviations in these features, and the recognition performance still requires improvement. In this sudy, we present a novel spatio‐temporal knowledge module (STKM) to endow the current methods with commonsense knowledge. To this end, we first collect hybrid external knowledge from universal fields, which contains both visual and semantic information. Then graph convolution networks (GCN) are used to represent and aggregate this knowledge. The GCNs involve (i) a spatial graph to capture spatial relations and (ii) a temporal graph to capture serial occurrence relations among actions. By integrating knowledge and visual features, we can get better recognition results. Experiments on AVA, UCF101‐24 and JHMDB datasets show the robustness and generalisation ability of STKM. The results report a new state‐of‐the‐art 32.0 mAP on AVA v2.1. On UCF101‐24 and JHMDB datasets, our method also improves by 1.5 AP and 2.6 AP, respectively, over the baseline method.

show abstract

“…Action Detection is a more challenging problem [20,90,69] compared to action recognition [67,6] problem due to the additional requirement for localisation of actions in a large spatial-temporal search space. Supervised action detection methods [81,69,34,44,90,56] has made large strides thanks to large scale datasets like UCF24 [73], AVA [26] and MultiSports [41]. Most of current approaches follow key-frame based approach popularised by SlowFast [20].…”

Section: Related Workmentioning

confidence: 99%

“…There has been more sophisticated approaches, e.g. based on actor-context modelling [10,56], on long-term feature banks [82,74], and on transformer heads [90,45]. We will make use of key-frame based SlowFast [20] network as our default action detector because of it's simplicity, competitive performance, and reproducible code base provided on pySlowFast [19], which can be easily extended to include transformer architectures, such as MViTv2 [45].…”

Section: Related Workmentioning

confidence: 99%

“…Over the past few years, we have witnessed tremendous progress in vision-based action detection [37,57,85,17,40,2,68,69,61,4,30,34,62,81,90,56]. This success is largely attributed to the deep neural networks, which demonstrates superior performance in several computer vi- Unlike the prior UDA method [76], which follows a class-based mixed sampling to generate augmented mixed images, our mixed sampling algorithm randomly samples image patches based on the number of action instances present in the source frames.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-adaptive Action Detection

Lu¹,

Singh²,

Saha³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose a novel domain adaptive action detection approach and a new adaptation protocol that leverages the recent advancements in image-level unsupervised domain adaptation (UDA) techniques and handle vagaries of instance-level video data. Self-training combined with cross-domain mixed sampling has shown remarkable performance gain in semantic segmentation in UDA (unsupervised domain adaptation) context. Motivated by this fact, we propose an approach for human action detection in videos that transfers knowledge from the source domain (annotated dataset) to the target domain (unannotated dataset) using mixed sampling and pseudo-labelbased self-training. The existing UDA techniques follow a ClassMix algorithm for semantic segmentation. However, simply adopting ClassMix for action detection does not work, mainly because these are two entirely different problems, i.e., pixel-label classification vs. instance-label detection. To tackle this, we propose a novel action instance mixed sampling technique that combines information across domains based on action instances instead of action classes. Moreover, we propose a new UDA training protocol that addresses the long-tail sample distribution and domain shift problem by using supervision from an auxiliary source domain (ASD). For the ASD, we propose a new action detection dataset with dense frame-level annotations. We name our proposed framework as domainadaptive action instance mixing (DA-AIM). We demonstrate that DA-AIM consistently outperforms prior works on challenging domain adaptation benchmarks. The source code and datasets are available at https://github.com/ wwwfan628/DA-AIM .

show abstract

TubeR: Tubelet Transformer for Video Action Detection

Cited by 33 publications

References 28 publications

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Exploiting spatio‐temporal knowledge for video action recognition

Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-adaptive Action Detection

Contact Info

Product

Resources

About