“…Hence, it is widely explored in both supervised [20,35,48,49,63,70] and self-supervised paradigm [28,29,34,36,39]. Self-supervised approaches learns temporal modelling by solving various pre-text tasks, such as dense future prediction [28,29], jigsaw puzzle solving [36,39], and pseudo motion classification [34], etc. Supervised video recognition explores various connections between different frames, such as 3D convolutions [62], temporal convolution [63], and temporal shift [48], etc.…”