2018
DOI: 10.1109/tpami.2017.2769085
|View full text |Cite
|
Sign up to set email alerts
|

Action Recognition with Dynamic Image Networks

Abstract: Abstract-We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic image encodes temporal data such as RGB or optical flow videos by using the concept of 'rank pooling'. The idea is to learn a ranking machine that captures the temporal evolution of the data and to use the parameters of the latter as a representation. When a linear ranking machine is used, the resulting represent… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
164
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 182 publications
(169 citation statements)
references
References 70 publications
(108 reference statements)
0
164
0
Order By: Relevance
“…The performance gain on HMDB51, 3,0%, is larger than on UCF101, 1.6%. The combined performances, 75.1% on HMDB51 and 96.0% on UCF101, outperform all state-of-the-arts and even on par with [2,34] which employ more prediction Methods HMDB51 UCF101 iDT+FV [43] 57.2 85.9 iDT+HSV [28] 61.1 87.9 Two-stream [31] 59.4 88.0 Transformation [48] 62.0 92.4 KVM [13] 63.3 93.1 Two-Stream Fusion [9] 65.4 (69.2 i) 92.5 (93.5 i) ST-ResNet [7] 66.4 (70.3 i) 93.4 (94.6 i) ST-Multiplier [8] 68.9 (72.2 i) 94.2 (94.9 i) ActionVLAD [11] 66.9 (69.8 i) 92.7 (93.6 i) ST-Vector [3] 69. scores from additional modalities. Note that we observe similar performance boost with iDT [42] but choose MIFS since the prediction scores are available in public.…”
Section: Comparison With the State-of-the-artmentioning
confidence: 88%
“…The performance gain on HMDB51, 3,0%, is larger than on UCF101, 1.6%. The combined performances, 75.1% on HMDB51 and 96.0% on UCF101, outperform all state-of-the-arts and even on par with [2,34] which employ more prediction Methods HMDB51 UCF101 iDT+FV [43] 57.2 85.9 iDT+HSV [28] 61.1 87.9 Two-stream [31] 59.4 88.0 Transformation [48] 62.0 92.4 KVM [13] 63.3 93.1 Two-Stream Fusion [9] 65.4 (69.2 i) 92.5 (93.5 i) ST-ResNet [7] 66.4 (70.3 i) 93.4 (94.6 i) ST-Multiplier [8] 68.9 (72.2 i) 94.2 (94.9 i) ActionVLAD [11] 66.9 (69.8 i) 92.7 (93.6 i) ST-Vector [3] 69. scores from additional modalities. Note that we observe similar performance boost with iDT [42] but choose MIFS since the prediction scores are available in public.…”
Section: Comparison With the State-of-the-artmentioning
confidence: 88%
“…UCF101 HMDB51 iDT+Fisher vector [19] 84.8 57.2 iDT+HSV [20] 87.9 61.1 C3D+iDT+SVM [8] 90.4 -Two-Stream (fusion by SVM) [7] 88.0 59.4 Two-Stream Fusion+iDT [21] 93.5 69.2 TSN (BN-Inception) [18] 94.2 69.4 Two-Stream I3D [9] 93.4 66.4 TDD+iDT [22] 91.5 65.9 Dynamic Image Network [6] 95.5 72.5 Temporal Squeeze Network 95.2 71.5 Table 4. Temporal squeeze network compared with other methods on UCF101 and HMDB51, in terms of top-1 accuracy, averaged over three splits.…”
Section: Methodsmentioning
confidence: 99%
“…This result further demonstrates that summarizing the dynamics of a long video clip into a single image would lose essential spatio-temporal information. A dynamic image [6] attempts to summarize the entire information from a video clip into a single image, which can explain why they fail to properly represent long video clips.…”
Section: Visualization Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…1. Generally speaking, four hypotheses that motivate us to build a skeleton-based representation and design DenseNets for 3D HAR include: (1) human actions can be correctly represented via movements of the skeleton [16]; (2) spatio-temporal evolutions of skeletons can be transformed into color images -a kind of 3D tensor that can be effectively learned by D-CNNs [1,5,3]. This hypothesis was proved in our previous studies [27,28,29]; (3) compared to RGB and depth modalities, skeletal data has high-level information with much less complexity.…”
Section: Introductionmentioning
confidence: 99%