“…For each video clip, following previous works (Sanabria et al, 2018;Palaskar et al, 2019;Khullar and Arora, 2020), a 2048-dimensional feature representation is extracted for every 16 non-overlapping frames using a 3D ResNeXt-101 model (Hara et al, 2018), which is pre-trained on the Kinetics dataset (Kay et al, 2017). Therefore, each data sample will have a sequence of 2048-vision feature vectors of length .…”