2019
DOI: 10.1007/978-3-030-11012-3_45
|View full text |Cite
|
Sign up to set email alerts
|

Learning Spatiotemporal 3D Convolution with Video Order Self-supervision

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 7 publications
0
4
0
Order By: Relevance
“…Pretext task based approaches: The work by Misra et al [35] was one of the first attempts at self-supervised video representation learning by learning to verify correct frame order. Other pretext tasks based on correct temporal order learning include: identifying the correctly ordered tuple from a set of shuffled orderings [13,41], sorting frame order [29], and predicting clip order [52]. There are some methods which extend pretext tasks from the image domain to video domain, for example, solving spatiotemporal jigsaw puzzles [3,26,4], identifying the rotation of transformed video clips [22].…”
Section: Related Workmentioning
confidence: 99%
“…Pretext task based approaches: The work by Misra et al [35] was one of the first attempts at self-supervised video representation learning by learning to verify correct frame order. Other pretext tasks based on correct temporal order learning include: identifying the correctly ordered tuple from a set of shuffled orderings [13,41], sorting frame order [29], and predicting clip order [52]. There are some methods which extend pretext tasks from the image domain to video domain, for example, solving spatiotemporal jigsaw puzzles [3,26,4], identifying the rotation of transformed video clips [22].…”
Section: Related Workmentioning
confidence: 99%
“…While modeling time remains a challenge, it also presents a natural source of supervision that has been exploited for self-supervised learning. For example, as a proxy signal by posing pretext tasks involving spatio-temporal jigsaw [1,43,52], video speed [10,16,47,94,109,123], arrow of time [78,80,112], frame/clip ordering [24,70,90,97,116], video continuity [60], or tracking [44,106,111]. Several works have also used contrastive learning to obtain spatio-temporal representations by (i) contrasting temporally augmented versions of a clip [46,77,81], or (ii) encouraging consistency between local and global temporal contexts [9,17,85,122].…”
Section: Time In Visionmentioning
confidence: 99%
“…These tasks operate multiple transformations on source videos for model to recognize and have shown to be effective in self-supervised representation learning (Wang et al 2021b). Examples include identifying temporal order of shuffled clips or frames (Lee et al 2017;Xu et al 2019;Suzuki et al 2018), predicting video's playback rate (Benaim et al 2020;Wang, Jiao, and Liu 2020;Chen et al 2021) or motion and appearance statistics (Wang et al 2019), identifying the rotation angle of video clips (Jing et al 2018) or solving spatiotemporal jigsaw puzzles (Ahsan, Madhok, and Essa 2019;Kim, Cho, and Kweon 2019), etc. In this work, we focus on an essential yet less-touched video property, the video continuity.…”
Section: Related Workmentioning
confidence: 99%