Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition

Liu, Yang; Lu, Zhaoyang; Li, Jing; Yang, Tao; Yao, Chao

doi:10.1109/tip.2019.2957930

Cited by 51 publications

(22 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For video representation learning, a large number of supervised learning methods have been proposed and received increasing attention, which relies on robust modeling and feature representation in videos. The methods include traditional methods [22,19,47,48,34,38,30,27] and deep learning methods [40,43,51,28,44,52,61,25,29,26]. To model and discover temporal knowledge in videos, twostream CNNs [40] judged the video image (spatial) and dense optical flow (temporal) separately, then directly fused the class scores of these two networks to obtain the classification result.…”

Section: Supervised Video Representation Learningmentioning

confidence: 99%

TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning

Liu

Wang

Liu

et al. 2022

IEEE Trans. on Image Process.

Self Cite

109

View full text Add to dashboard Cite

Attempt to fully explore the fine-grained temporal structure and global-local chronological characteristics for selfsupervised video representation learning, this work takes a closer look at exploiting the temporal structure of videos and further proposes a novel self-supervised method named Temporal Contrastive Graph (TCG). In contrast to the existing methods that randomly shuffle the video frames or video snippets within a video, our proposed TCG roots in a hybrid graph contrastive learning strategy to regard the intersnippet and intra-snippet temporal relationships as selfsupervision signals for temporal representation learning. To increase the temporal diversity of features more comprehensively and precisely, our proposed TCG integrates the prior knowledge about the frame and snippet orders into temporal contrastive graph structures, i.e., the intra-/inter-snippet temporal contrastive graph modules. By randomly removing edges and masking node features of the intra-snippet graphs or inter-snippet graphs, our TCG can generate different correlated graph views. Then, specific contrastive losses are designed to maximize the agreement between node embeddings in different views. To learn the global context representation and recalibrate the channelwise features adaptively, we introduce an adaptive video snippet order prediction module, which leverages the relational knowledge among video snippets to predict the actual snippet orders.Experimental results demonstrate the superiority of our TCG over the state-of-the-art methods on largescale action recognition and video retrieval benchmarks.

show abstract

Section: Supervised Video Representation Learningmentioning

confidence: 99%

TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning

Liu

Wang

Liu

et al. 2022

IEEE Trans. on Image Process.

Self Cite

109

View full text Add to dashboard Cite

show abstract

“…In [32], the authors employed transfer learning from the image domain to enhance video action recognition, while in [33], the authors proposed a Generative Adversarial Network that learns a common feature space of images and videos to improve recognition accuracy. Finally, in [34], the authors integrated images and videos into a common representation using cross-modal similarity metrics to enhance the action recognition accuracy. In this work, a cross-modal method for CSLR is proposed, which takes advantage of the ability of CTC to handle weakly labeled data, while simultaneously leverages text information to model intra-gloss dependencies through the cross-modal alignment of video and text embeddings.…”

Section: Related Workmentioning

confidence: 99%

Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space

et al. 2020

View full text Add to dashboard Cite

Continuous Sign Language Recognition (CSLR) refers to the challenging problem of recognizing sign language glosses and their temporal boundaries from weakly annotated video sequences. Previous methods focus mostly on visual feature extraction neglecting text information and failing to effectively model the intra-gloss dependencies. In this work, a cross-modal learning approach that leverages text information to improve vision-based CSLR is proposed. To this end, two powerful encoding networks are initially used to produce video and text embeddings prior to their mapping and alignment into a joint latent representation. The purpose of the proposed cross-modal alignment is the modelling of intra-gloss dependencies and the creation of more descriptive video-based latent representations for CSLR. The proposed method is trained jointly with video and text latent representations. Finally, the aligned video latent representations are classified using a jointly trained decoder. Extensive experiments on three well-known sign language recognition datasets and comparison with state-of-the-art approaches demonstrate the great potential of the proposed approach.

show abstract

“…With the development of deep learning methods and various hardware such as cameras and wearable devices, there are some typical methods of dealing with multi-modal action recognition problems in recent years. These methods can be roughly categorized into three types: 1) cross-view action recognition, typical works [29], [30] used transfer learning methods to reduce the domain gap of action data from different camera views; 2) cross-spectral action recognition, typical works [31], [32] addressed the visible-to-infrared action recognition problems using domain adaptation methods; 3) crossmedia action recognition, typical works [33], [34] designed specific multi-modal feature learning frameworks to address the image-to-video action recognition problems.…”

Section: B Multi-modal Action Recognitionmentioning

confidence: 99%

Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision Action Recognition

Liu,

Wang,

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition

Cited by 51 publications

References 58 publications

TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning

TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning

Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space

Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision Action Recognition

Contact Info

Product

Resources

About