Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

Saha, Suman Kumar; Singh, Gurkirt; Sapienza, Michael; Torr, Philip H. S.; Cuzzolin, Fabio

doi:10.48550/arxiv.1608.01529

Cited by 18 publications

(27 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…After detecting the action regions in the frames, some methods [74], [75], [76], [77], [78], [79], [80], [81] use optical flow to capture motion cues. They employ linking algorithms to connect the frame-level bounding boxes into spatio-temporal action tubes.…”

Section: Frame-level Action Detectionmentioning

confidence: 99%

“…Weinzaepfel et al [79] replaced the linking algorithm by a tracking-by-detection method. Then, two-stream Faster R-CNN was introduced by [76], [78]. Saha et al [78] fuse the scores of both streams based on overlap between the appearance and the motion.…”

Section: Frame-level Action Detectionmentioning

confidence: 99%

“…Then, two-stream Faster R-CNN was introduced by [76], [78]. Saha et al [78] fuse the scores of both streams based on overlap between the appearance and the motion. Peng et al [76] combine proposals extracted from the two streams and then classify and regress them with fused RGB and multi-frame optical flow features.…”

Section: Frame-level Action Detectionmentioning

confidence: 99%

See 2 more Smart Citations

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Tian¹

2021

Preprint

View full text Add to dashboard Cite

Understanding human behavior and activity facilitates advancement of numerous real-world applications, and is critical for video analysis. Despite the progress of action recognition algorithms in trimmed videos, the majority of real-world videos are lengthy and untrimmed with sparse segments of interest. The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories. Temporal activity detection task has been investigated in full and limited supervision settings depending on the availability of action annotations. This paper provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly-supervised, unsupervised, self-supervised, and semi-supervised. In addition, this paper also reviews advances in spatio-temporal action detection where actions are localized in both temporal and spatial dimensions. Moreover, the commonly used action detection benchmark datasets and evaluation metrics are described, and the performance of the state-of-the-art methods are compared. Finally, real-world applications of temporal action detection in untrimmed videos and a set of future directions are discussed.

show abstract

Section: Frame-level Action Detectionmentioning

confidence: 99%

Section: Frame-level Action Detectionmentioning

confidence: 99%

See 1 more Smart Citation

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Tian¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Spatio-temporal activity detection localizes activities within spatio-temporal tubes and requires heavier annotation work to collect the training data. [7,22,35,39,43] temporally track bounding boxes corresponding to activities in each frame to realize spatiotemporal activity detection. We focus on temporal activity detection [4,14,18,26,28,38] which only predicts the start and end times of the activities within long untrimmed videos and classifies the overall activity without spatially localizing people and objects in the frame.…”

Section: Activity Detectionmentioning

confidence: 99%

Similarity R-C3D for Few-shot Temporal Activity Detection

Xu¹,

Kang²,

Sun³

et al. 2018

Preprint

View full text Add to dashboard Cite

Many activities of interest are rare events, with only a few labeled examples available. Therefore models for temporal activity detection which are able to learn from a few examples are desirable. In this paper, we present a conceptually simple and general yet novel framework for fewshot temporal activity detection which detects the start and end time of the few-shot input activities in an untrimmed video. Our model is end-to-end trainable and can benefit from more few-shot examples. At test time, each proposal is assigned the label of the few-shot activity class corresponding to the maximum similarity score. Our Similarity R-C3D method outperforms previous work on three large scale benchmarks for temporal activity detection (THU-MOS14, ActivityNet1.2 and ActivityNet1.3 datasets) in the few-shot setting. Our code will be made available.

show abstract

“…The proposed pipeline (Figure 2) is divided into three parts: (i) action tubes detection, (ii) partbased feature extraction and learning via 3D deformable RoI pooling, and (iii) a sparsity strategy to allow for components of an action to be activated or deactivated while retaining the overall semantics in terms of activity. Action tube detection is a necessary prepossessing step aimed at detecting the spatiotemporal location of all the atomic ac-tions present [11,22,41,31,4,40,33]. The tube detector needs to ensure a fixed-size representation for each activity part (atomic action).…”

Section: Introductionmentioning

confidence: 99%

Spatiotemporal Deformable Scene Graphs for Complex Activity Detection

Khan¹,

Cuzzolin²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Long-term complex activity recognition and localisation can be crucial for the decision-making process of several autonomous systems, such as smart cars and surgical robots. Nonetheless, most current methods are designed to merely localise short-term action/activities or combinations of atomic actions that only last for a few frames or seconds. In this paper, we address the problem of longterm complex activity detection via a novel deformable, spatiotemporal parts-based model. Our framework consists of three main building blocks: (i) action tube detection, (ii) the modelling of the deformable geometry of parts, and (iii) a sparsity mechanism. Firstly, action tubes are detected in a series of snippets using an action tube detector. Next, a new 3D deformable RoI pooling layer is designed for learning the flexible, deformable geometry of the constellation of parts. Finally, a sparsity strategy differentiates between activated and deactivate features. We also provide temporal complex activity annotation for the recently released ROAD autonomous driving dataset and the SARAS-ESAD surgical action dataset, to validate our method and show the adaptability of our framework to different domains. As they both contain long videos portraying long-term activities they can be used as benchmarks for future work in this area.

show abstract

Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

Cited by 18 publications

References 0 publications

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Similarity R-C3D for Few-shot Temporal Activity Detection

Spatiotemporal Deformable Scene Graphs for Complex Activity Detection

Contact Info

Product

Resources

About