“…In self-supervised video representation learning, a line of works designed various pretext tasks, e.g., temporal ordering [46,74,75], spatiotemporal puzzles [33,63], colorization [59], playback speed prediction [31,6] and temporal cycle-consistency [66,30,37]. Some works proposed to predict future frames from the given sequence to learn feature embeddings [58,57,43,5].…”