“…This advancement stems, in part from the success of multi-modal pretraining on web-scale vision-text data [8,21,31,34,38,44,52,53,54,63], and in part from the unified deep neural network that can well model both vision and natural language data, i.e., transformer [55]. As a typical multi-disciplinary AI task, Video Question Answering (VideoQA) has benefited a lot from these developments which helps to propel the field steadily forward over the use of purely conventional techniques [14,16,20,23,28,60,71].…”