2020
DOI: 10.1109/tip.2019.2957930
|View full text |Cite
|
Sign up to set email alerts
|

Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
22
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
1

Relationship

2
6

Authors

Journals

citations
Cited by 51 publications
(22 citation statements)
references
References 58 publications
0
22
0
Order By: Relevance
“…For video representation learning, a large number of supervised learning methods have been proposed and received increasing attention, which relies on robust modeling and feature representation in videos. The methods include traditional methods [22,19,47,48,34,38,30,27] and deep learning methods [40,43,51,28,44,52,61,25,29,26]. To model and discover temporal knowledge in videos, twostream CNNs [40] judged the video image (spatial) and dense optical flow (temporal) separately, then directly fused the class scores of these two networks to obtain the classification result.…”
Section: Supervised Video Representation Learningmentioning
confidence: 99%
“…For video representation learning, a large number of supervised learning methods have been proposed and received increasing attention, which relies on robust modeling and feature representation in videos. The methods include traditional methods [22,19,47,48,34,38,30,27] and deep learning methods [40,43,51,28,44,52,61,25,29,26]. To model and discover temporal knowledge in videos, twostream CNNs [40] judged the video image (spatial) and dense optical flow (temporal) separately, then directly fused the class scores of these two networks to obtain the classification result.…”
Section: Supervised Video Representation Learningmentioning
confidence: 99%
“…In [32], the authors employed transfer learning from the image domain to enhance video action recognition, while in [33], the authors proposed a Generative Adversarial Network that learns a common feature space of images and videos to improve recognition accuracy. Finally, in [34], the authors integrated images and videos into a common representation using cross-modal similarity metrics to enhance the action recognition accuracy. In this work, a cross-modal method for CSLR is proposed, which takes advantage of the ability of CTC to handle weakly labeled data, while simultaneously leverages text information to model intra-gloss dependencies through the cross-modal alignment of video and text embeddings.…”
Section: Related Workmentioning
confidence: 99%
“…With the development of deep learning methods and various hardware such as cameras and wearable devices, there are some typical methods of dealing with multi-modal action recognition problems in recent years. These methods can be roughly categorized into three types: 1) cross-view action recognition, typical works [29], [30] used transfer learning methods to reduce the domain gap of action data from different camera views; 2) cross-spectral action recognition, typical works [31], [32] addressed the visible-to-infrared action recognition problems using domain adaptation methods; 3) crossmedia action recognition, typical works [33], [34] designed specific multi-modal feature learning frameworks to address the image-to-video action recognition problems.…”
Section: B Multi-modal Action Recognitionmentioning
confidence: 99%