“…In addition, most methods use dense sampling for the input video, e.g., HCRN [Le et al, 2020] andBridge2Answer [Park et al, 2021] sampled 8 clips each comprising 16 frames, while our method with scale N set to 3 only samples 7 clips so that costing less computational loads. [Fan et al, 2019] 73.9 77.8 53.8 4.02 FAM [Cai et al, 2020] 75.4 79.2 56.9 3.79 L- GCN [Huang et al, 2020] 74.3 81.1 56.3 3.95 HGA [Jiang and Han, 2020] 75.4 81.0 55.1 4.09 HCRN [Le et al, 2020] 75.0 81.4 55.9 3.82 Bridge2Answer [Park et al, 2021] 75.9 82.6 57.5 3.71 HOSTR [Dang et al, 2021] 75 et al, 2018] 31,7 31.9 HME [Fan et al, 2019] 33.7 33.0 FAM [Cai et al, 2020] 34.5 33.2 HGA [Jiang and Han, 2020] 34.7 35.5 HCRN [Le et al, 2020] 36.1 35.6 Bridge2Answer [Park et al, 2021] 37.2 36.9 HOSTR [Dang et al, 2021] 39.4 35.9 Further comparisons on the MSVD-QA and MSRVTT-QA datasets are conducted. Results are reported in Table 2.…”