Spatio-temporal Contrastive Domain Adaptation for Action Recognition

Song, Xiaolin; Zhao, Sicheng; Yang, Jingyu; Yue, Huanjing; Xu, Pengfei; Hu, Runbo; Chai, Hua

doi:10.1109/cvpr46437.2021.00966

Cited by 56 publications

(50 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, we focus on video domain adaptation for activity recognition. State-of-the-art visual-only solutions learn to reduce the shift in activity appearance by adversarial training [5,6,8,9,20,27,29] and self-supervised learning techniques [9,22,27,34]. While Jamal et al [20] and Munro and Damen [27] directly penalize domain specific features with an adversarial loss at every time stamp, Chen et al [5], Choi et al [9] and Pan et al [29] attend to temporal segments that contain important cues.…”

Section: Related Workmentioning

confidence: 99%

“…Self-supervised learning objectives are also incorporated in [27] and [9] to better align the features across domains by utilizing the correspondences between RGB and optical flow or the temporal order of video clips. Song et al [34] and Kim et al [22] obtain remarkable performance by contrastive learning for self-supervised learning to align the feature distributions between video domains. Instead of relying on the vision modality only, which may present large activity appearance variance, we consider the domain-invariant information within sound to help the model adapt to the visual distribution shift.…”

Section: Related Workmentioning

confidence: 99%

“…The networks are initialized with Kinetics pre-trained model weights. Such pre-training on a large dataset is common in domain adaptation, e.g., for images (ImageNet) [7,17,40,41] and videos (Sports-1M [20], Kinetics [6,9,22,27,34]). Audio encoder.…”

Section: A Implementation Detailsmentioning

confidence: 99%

“…When using a single modality, the output from the activity recognizer R(•) is directly used as the final recognition prediction. On EPIC-Kitchens [27], when using both RGB and optical flow, we average the predictions from the two modalities as the final classification result, following prior works [22,27,34].…”

Section: A Implementation Detailsmentioning

confidence: 99%

“…The goal of this paper is to recognize activities such as eating, sleeping or cutting under domain shift caused by change of scenery, camera viewpoint or actor, as shown in Figure 1. Existing solutions align distribution-shifted domains inside a single visual video network by adversarial training [5,20,27,29] and self-supervised learning [9,22,34]. Although successful, projecting the visual features from different source and target domains into a shared space can make the ability of the model to distinguish between classes in the target domain suffer.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Audio-Adaptive Activity Recognition Across Video Domains

Zhang¹,

Doughty²,

Shao³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper strives for activity recognition under domain shift, for example caused by change of scenery or camera viewpoint. The leading approaches reduce the shift in activity appearance by adversarial training and self-supervised learning. Different from these vision-focused works we leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening. We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation as well as addressing shifts in the semantic distribution. To further eliminate domain-specific features and include domain-invariant activity sounds for recognition, an audio-infused recognizer is proposed, which effectively models the cross-modal interaction across domains. We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically. Experiments on this dataset, EPIC-Kitchens and CharadesEgo show the effectiveness of our approach. Project page: https://xiaobai1217. github.io/DomainAdaptation.

show abstract