“…Misra et al [40] introduce the idea of learning such visual representations by estimating the order of shuffled video frames. Inspired by the success of this approach, several recent papers focused on designing a novel pretext task using temporal information, such as predicting future frames [13,49,54] or their embeddings [21,27]; estimating the order of frames [10,20,36,40,57] or the direction of video [56]. Another line of research focuses on using temporal coherence [6,24,26,41,62,63] as supervision signal.…”