Shuffle and Attend: Video Domain Adaptation

Choi, Jin-Hwan; Sharma, Gaurav; Schulter, Samuel; Huang, Jia

doi:10.1007/978-3-030-58610-2_40

Cited by 72 publications

(96 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• We conduct extensive experiments on several challenging benchmarks (UCF-HMDB [9], Jester [57], and Epic-Kitchens [54]) for video domain adaptation to demonstrate the superiority of our approach over state-of-the-art methods. Our experiments show that CoMix delivers a significant performance increase over the compared methods, e.g., CoMix outperforms SAVA [12] (ECCV'20) by 3.6% on UCF-HMDB [9] and TA…”

Section: Introductionmentioning

confidence: 89%

“…More recently, very few works have attempted deep UDA for video action recognition by directly matching segment-level features [9,28,54,45] or with attention weights [12,57]. However, (1) trivially matching segment-level feature distributions by extending the image-specific approaches, without considering the rich temporal information may not alone be sufficient for video domain adaptation; (2) prior methods often focus on aligning target features with source, rather than exploiting any action semantics shared across both domains (e.g., difference in background with the same action: videos in the top row of Figure 1 are from the source and target domain respectively, but both capture the same action walking); (3) existing methods often rely on complex adversarial learning which is unwieldy to train, resulting in very fragile convergence.…”

Section: Introductionmentioning

confidence: 99%

“…To this end, we introduce Contrast and Mix (CoMix), a simple yet effective approach based on contrastive learning to adapt video action recognition models trained on a labeled source domain to unlabelled target domains. First, we propose to represent video as a graph and then utilize temporal contrastive self-supervised learning over the graph representations as a nexus between source and target domains to align features, without requiring any additional adversarial learning, as most prior works do in video domain adaptation [9,12,57]. Specifically, we maximize the similarity between encoded representations of the same video at two different speeds as well as minimize the similarity between different videos played at different speeds, leveraging the fact that changing video speed does not change an action on both domains.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing

Sahoo¹,

Shah²,

Panda³

et al. 2021

Preprint

View full text Add to dashboard Cite

Unsupervised domain adaptation which aims to adapt models trained on a labeled source domain to a completely unlabeled target domain has attracted much attention in recent years. While many domain adaptation techniques have been proposed for images, the problem of unsupervised domain adaptation in videos remains largely underexplored. In this paper, we introduce Contrast and Mix (CoMix), a new contrastive learning framework that aims to learn discriminative invariant feature representations for unsupervised video domain adaptation. First, unlike existing methods that rely on adversarial learning for feature alignment, we utilize temporal contrastive learning to bridge the domain gap by maximizing the similarity between encoded representations of an unlabeled video at two different speeds as well as minimizing the similarity between different videos played at different speeds. Second, we propose a novel extension to the temporal contrastive loss by using background mixing that allows additional positives per anchor, thus adapting contrastive learning to leverage action semantics shared across both domains. Moreover, we also integrate a supervised contrastive learning objective using target pseudo-labels to enhance discriminability of the latent space for video domain adaptation. Extensive experiments on several benchmark datasets demonstrate the superiority of our proposed approach over state-of-the-art methods. Project page: https://cvir.github.io/projects/comix.

show abstract

Section: Introductionmentioning

confidence: 89%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing

Sahoo¹,

Shah²,

Panda³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Misra et al [40] introduce the idea of learning such visual representations by estimating the order of shuffled video frames. Inspired by the success of this approach, several recent papers focused on designing a novel pretext task using temporal information, such as predicting future frames [13,49,54] or their embeddings [21,27]; estimating the order of frames [10,20,36,40,57] or the direction of video [56]. Another line of research focuses on using temporal coherence [6,24,26,41,62,63] as supervision signal.…”

Section: Related Workmentioning

confidence: 99%

Learning to Align Sequential Actions in the Wild

Liu¹,

Tekin²,

Coskun³

et al. 2021

Preprint

View full text Add to dashboard Cite

State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspondences across videos in time. They either learn frame-toframe mapping across sequences, which does not leverage temporal information, or assume monotonic alignment between each video pair, which ignores variations in the order of actions. As such, these methods are not able to deal with common real-world scenarios that involve background frames or videos that contain non-monotonic sequence of actions.In this paper, we propose an approach to align sequential actions in the wild that involve diverse temporal variations. To this end, we propose an approach to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions. Our model accounts for both monotonic and non-monotonic sequences and handles background frames that should not be aligned. We demonstrate that our approach consistently outperforms the stateof-the-art in self-supervised sequential action representation learning on four different benchmark datasets.

show abstract

“…Domain Adaptation for Videos. Prior works for video domain adaptation (DA) have focused on classification [6,11,28,42], segmentation [7,8] and localisation [2]. They use adversarial training to align the marginal distributions [28], an auxiliary self-supervised task [8,11,42], or attending to relevant frames alignment [6][7][8].…”

Section: Related Workmentioning

confidence: 99%

Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval

Munro¹,

Wray²,

Larlus³

et al. 2021

Preprint

View full text Add to dashboard Cite

Figure 1: Given video-text pairs (respectively denoted by circles and stars) from the source (blue), and a video-only target set (purple), we propose an alignment method to reduce the domain gap between the source videos and the target videos using pseudo-labels (Section C) and cross-domain ranking (Section 3.2). The learnt and aligned space can then be used for retrieving a ranked list of target videos using previously unseen text queries.

show abstract

Shuffle and Attend: Video Domain Adaptation

Cited by 72 publications

References 48 publications

Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing

Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing

Learning to Align Sequential Actions in the Wild

Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval

Contact Info

Product

Resources

About