Hybrid Relation Guided Set Matching for Few-shot Action Recognition

Wang, Xiang; Zhang, Shiwei; Qing, Zhiwu; Tang, Mingqian; Zuo, Zhang; Gao, Changxin; Jin, Rong; Sang, Nong

doi:10.1109/cvpr52688.2022.01932

Cited by 68 publications

(74 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As indicated in Table 1 and Table 2, the proposed HyRSM++ surpasses other advanced approaches significantly and is able to achieve new stateof-the-art performance. For instance, HyRSM++ improves the state-of-the-art performance from 49.2% to 55.0% under the 1-shot setting on SSv2-Full and consistently outperforms our original conference version [91]. Specially, extensively compared with current strict temporal alignment techniques [7,106] and complex fusion methods [48,68], HyRSM++ produces results that are superior to them under most different shots, which implies that our approach is considerably flexible and efficient.…”

Section: Comparison With State-of-the-artmentioning

confidence: 75%

See 1 more Smart Citation

HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot Action Recognition

Wang¹,

Zhang²,

Qing³

et al. 2023

Preprint

View full text Add to dashboard Cite

Few-shot action recognition is a challenging but practical problem aiming to learn a model that can be easily adapted to identify new action categories with only a few labeled samples. Recent attempts mainly focus on learning deep representations for each video individually under the episodic meta-learning regime and then performing temporal alignment to match query and support videos. However, they still suffer from two drawbacks: (i) learning individual features without considering the entire task may result in limited representation capability, and (ii) existing alignment strategies are sensitive to noises and misaligned instances. To handle the two limitations, we propose a novel Hybrid Relation guided temporal Set Matching (HyRSM++) approach for few-shot action recognition. The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations and involve a robust matching technique. To be specific, HyRSM++ consists of two key components, a hybrid relation module and a temporal set matching metric. Given the basic representations from the feature extractor, the hybrid relation module is introduced to fully exploit associated relations within and cross videos in an episodic task and thus can learn task-specific embeddings. Subsequently, in the temporal set matching metric, we carry out the distance measure between query and support

show abstract

Section: Comparison With State-of-the-artmentioning

confidence: 75%

“…In this paper, we have extended our preliminary CVPR-2022 conference version [91] in the following aspects. i) We integrate the temporal coherence regularization and set matching strategy into a temporal set matching metric so that the proposed metric can explicitly leverage temporal order information in videos and match flexibly.…”

Section: Introductionmentioning

confidence: 99%

HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot Action Recognition

Wang¹,

Zhang²,

Qing³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…We compare SSA 2 lign with state-of-the-art FSDA approaches, and prevailing UDA/VUDA and few-shot action recognition (FSAR) approaches. These methods include: FADA [26], d-SNE [54] designed for image-based FSDA; DANN [10], MK-MMD [24], MDD [67], SAVA [8] and ACAN [56], designed for UDA/VUDA; and TRX [30], STRM [41], and HyRSM [50] proposed for FSAR. To adapt the FSAR approaches for FSVDA, the source domain is used for meta-training and the target domain is used for the meta-testing, while target labels are available for optimizing the cross-entropy loss to adapt UDA/VUDA approaches for FSVDA.…”

Section: Overall Results and Comparisonsmentioning

confidence: 99%

Augmenting and Aligning Snippets for Few-Shot Video Domain Adaptation

Xu¹,

Yang²,

Zhou³

et al. 2023

Preprint

View full text Add to dashboard Cite

For video models to be transferred and applied seamlessly across video tasks in varied environments, Video Unsupervised Domain Adaptation (VUDA) has been introduced to improve the robustness and transferability of video models. However, current VUDA methods rely on a vast amount of high-quality unlabeled target data, which may not be available in real-world cases. We thus consider a more realistic Few-Shot Video-based Domain Adaptation (FSVDA) scenario where we adapt video models with only a few target video samples. While a few methods have touched upon Few-Shot Domain Adaptation (FSDA) in images and in FSVDA, they rely primarily on spatial augmentation for target domain expansion with alignment performed statistically at the instance level. However, videos contain more knowledge in terms of rich temporal and semantic information, which should be fully considered while augmenting target domains and performing alignment in FSVDA. We propose a novel SSA 2 lign to address FSVDA at the snippet level, where the target domain is expanded through a simple snippet-level augmentation followed by the attentive alignment of snippets both semantically and statistically, where semantic alignment of snippets is conducted through multiple perspectives. Empirical results demonstrate state-of-the-art performance of SSA 2 lign across multiple cross-domain action recognition benchmarks.

show abstract

“…Some methods [84,72,82,83] adopt the idea of global matching in the field of few-shot image classification [50,52] to carry out few-shot matching, which results in relatively poor performance because long-term temporal alignment information is ignored in the measurement process. To exploit the temporal cues, the following approaches [3,76,42,29,64,53,61,38,19,62,77] focuses on local frame-level (or segment-level) alignment between query and support videos. Among them, OTAM [3] proposes a variant of the dynamic time warping technique [37] to explicitly utilize the temporal ordering information in support-query video pairs.…”

Section: Related Workmentioning

confidence: 99%

“…Despite this, modern models require massive data annotation, which may be time-consuming and laborious to collect. Few-shot action recognition is a promising direction to alleviate the data labeling problem, which aims to identify unseen classes with a few labeled videos and has received considerable attention [82,3,61].…”

Section: Introductionmentioning

confidence: 99%

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Wang¹,

Zhang²,

Cen³

et al. 2023

Preprint

View full text Add to dashboard Cite

Learning from large-scale contrastive languageimage pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue due to data scarcity, which is a critical problem in low-shot regimes. To this end, we present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of two key components: a video-text contrastive objective and a prototype modulation. Specifically, the former bridges the task discrepancy between CLIP and the few-shot video task by contrasting videos and corresponding class text descriptions. The latter leverages the transferable textual concepts from CLIP to adaptively refine visual prototypes with a temporal Transformer. By this means, CLIP-FSAR can take full advantage of the rich semantic priors in CLIP to obtain reliable prototypes and achieve accurate few-shot classification. Extensive experiments on five commonly used benchmarks demonstrate the effectiveness of our proposed method, and CLIP-FSAR significantly outperforms existing state-of-theart methods under various settings. The source code and

show abstract

Hybrid Relation Guided Set Matching for Few-shot Action Recognition

Cited by 68 publications

References 36 publications

HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot Action Recognition

HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot Action Recognition

Augmenting and Aligning Snippets for Few-Shot Video Domain Adaptation

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Contact Info

Product

Resources

About