“…Humans have an innate cognitive ability to infer from different sensory inputs to answer questions of 5W's and 1H involving who, what, when, where, why and how, and it has been a quest of mankind to duplicate this ability on machines. In recent years, studies on question answering (QA) have successfully benefited from deep neural networks, and showed remarkable performance improvement on textQA [24,30], imageQA [2,3,19,31], videoQA [8,11,32,34]. This paper considers movie story QA [15,18,21,26,29] that aims at a joint understanding of vision and language by answering questions about movie contents and storyline after observing temporally-aligned video and subtitle.…”