Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition

Fu, Yuqian; Zhang, Li; Wang, Junke; Fu, Yanwei; Jiang, Yu‐Gang

doi:10.1145/3394171.3413502

Cited by 62 publications

(34 citation statements)

References 51 publications

(70 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cao et al [6] focus on long-term temporal ordering information and propose a temporalalignment based method for few-shot action recognition. Fu et al [16] introduce depth information as extra visual information and propose a temporal asynchronization augmentation mechanism to augment source video representation. Besides, they propose a depth guided adaptive instance normalization module to fuse original RGB clips with non-strictly corresponding depth clips at the feature level.…”

Section: Few-shot Action Recognitionmentioning

confidence: 99%

“…Besides, the query set is sampled from the rest videos of the M classes. Following [16], in the training phase and testing phase, a query set has one video in each episode.…”

Section: Model Formulation 31 Architecture Overviewmentioning

confidence: 99%

“…( 8) TAM [6] estimates the temporal alignment scores to measure the similarities between the query videos and the support videos. ( 9) AMeFu-Net [16] mainly introduces depth information to alleviate the data scarcity problem.…”

Section: Comparison With the State-of-the-artsmentioning

confidence: 99%

See 2 more Smart Citations

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Wang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Few-shot action recognition has drawn growing attention as it can recognize novel action classes by using only a few labeled samples. In this paper, we propose a novel semantic-guided relation propagation network (SRPN), which leverages semantic information together with visual information for few-shot action recognition. Different from most previous works that neglect semantic information in the labeled data, our SRPN directly utilizes the semantic label as an additional supervisory signal to improve the generalization ability of the network. Besides, we treat the relation of each visual-semantic pair as a relational node, and we use a graph convolutional network to model and propagate such sample relations across visual-semantic pairs, including both intra-class commonality and inter-class uniqueness, to guide the relation propagation in the graph. However, since videos contain crucial sequences and ordering information, we propose a novel spatial-temporal difference module, which can facilitate the network to enhance the visual feature learning ability at both feature level and granular level for videos. Extensive experiments conducted on several challenging benchmarks demonstrate that our SRPN outperforms several state-of-the-art methods with a significant margin.

show abstract

Section: Few-shot Action Recognitionmentioning

confidence: 99%

“…Besides, the query set is sampled from the rest videos of the M classes. Following [16], in the training phase and testing phase, a query set has one video in each episode.…”

Section: Model Formulation 31 Architecture Overviewmentioning

confidence: 99%

See 1 more Smart Citation

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Wang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Different modules of fusing instance appearance information and action structure information can be applied in this node, such as bilinear pooling [24,41], attention mechanisms [21,39,43], and other approaches [11,34]. For simplicity, we use a concatenation operation following with fully connected layers as the fusion module.…”

Section: Appearance Bias In Compositional Actionmentioning

confidence: 99%

Counterfactual Debiasing Inference for Compositional Action Recognition

Sun

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Compositional action recognition is a novel challenge in the computer vision community and focuses on revealing the different combinations of verbs and nouns instead of treating subject-object interactions in videos as individual instances only. Existing methods tackle this challenging task by simply ignoring appearance information or fusing object appearances with dynamic instance tracklets. However, those strategies usually do not perform well for unseen action instances. For that, in this work we propose a novel learning framework called Counterfactual Debiasing Network (CDN) to improve the model generalization ability by removing the interference introduced by visual appearances of objects/subjects. It explicitly learns the appearance information in action representations and later removes the effect of such information in a causal inference manner. Specifically, we use tracklets and video content to model the factual inference by considering both appearance information and structure information. In contrast, only video content with appearance information is leveraged in the counterfactual inference. With the two inferences, we conduct a causal graph which captures and removes the bias introduced by the appearance information by subtracting the result of the counterfactual inference from that of the factual inference. By doing that, our proposed CDN method can better recognize unseen action instances by debiasing the effect of appearances. Extensive experiments on the Something-Else dataset clearly show the effectiveness of our proposed CDN over existing state-of-the-art methods. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding; Causal reasoning and diagnostics.

show abstract

“…With extensive efforts devoted to few-shot learning (FSL) recently, great success has been achieved in few-shot image classification [12,34,37,40,53]. Spurred by that, attempts are being made to extend FSL to action recognition domain [3,9,23,27,48,51,52]. Most methods follow the established practice of meta-learning paradigm [40], where models are trained with randomly sampled support and query sets episodically.…”

Section: Introductionmentioning

confidence: 99%

Few-shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning

Wang

Liu

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications, whereas the data of rare fine-grained categories is very limited. Therefore, we propose the few-shot finegrained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class. Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling finegrained actions: the inability to capture subtle action details and the inadequacy in learning from data with low inter-class variance. To tackle the first issue, a human vision inspired bidirectional attention module (BAM) is proposed. Combining top-down task-driven signals with bottom-up salient stimuli, BAM captures subtle action details by accurately highlighting informative spatio-temporal regions. To address the second issue, we introduce contrastive metalearning (CML). Compared with the widely adopted ProtoNet-based method, CML generates more discriminative video representations for low inter-class variance data, since it makes full use of potential contrastive pairs in each training episode. Furthermore, to fairly compare different models, we establish specific benchmark protocols on two large-scale fine-grained action recognition datasets. Extensive experiments show that our method consistently achieves state-of-the-art performance across evaluated tasks. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding.

show abstract

Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition

Cited by 62 publications

References 51 publications

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Counterfactual Debiasing Inference for Compositional Action Recognition

Few-shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning

Contact Info

Product

Resources

About