“…For example, TVQA [15] presents a large-scale dataset and a model that leverages Faster-RCNN [21] and LSTMs [12] to process visual and language inputs, and the use of attention mechanisms [24] has also achieved a great success [30,34,35]. Recently, a new research direction in VideoQA has emerged, i.e., external knowledge-based VideoQA [7,8], which requires information that cannot be directly obtained from the videos or the question-answer (QA) pairs, and thus, cannot be learned from the dataset. In this task, therefore, a model needs to refer to knowledge from external sources.…”