“…Unsupervised learning in videos has followed a similar trajectory with earlier methods focusing on predictive tasks based on motion, color and spatiotemporal ordering [29,43,1,44,78,85,60,84,58,57,21,51,86,66,22,48,91,16,87,70,45], and contrastive objectives with visual [74,79,34,53,28,92] and audio-visual input [65,4,5,49,3,68,69].…”