“…Multi-modal Video Captioning Video natu-rally has multi-modal inputs including visual, speech text, and audio. Previous works explore visual RGB, motion, optical flow features, audio features (Hori et al, 2017;Wang et al, 2018b;Rahman et al, 2019) as well as speech text features (Shi et al, 2019;Hessel et al, 2019;Iashin and Rahtu, 2020b) for captioning. According to the work in (Shi et al, 2019;Hessel et al, 2019;Iashin and Rahtu, 2020b), although the speech text is noisy and informal, it can still capture better semantic features and improve performance especially for instructional videos.…”