Multimodal Abstractive Summarization for How2 Videos

Palaskar, Shruti; Libovický, Jindřich; Gella, Spandana; Metze, Florian

doi:10.18653/v1/p19-1659

Cited by 62 publications

(45 citation statements)

References 43 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multimodal Summarization. A series of works Palaskar et al, 2019;Chan et al, 2019;Gao et al, 2020a) focused on generating better textual summaries with the help of multimodal input. Multimodal summarization with multimodal output is relatively less explored.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Li¹,

Chen²,

Gao³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

A popular multimedia news format nowadays is providing users with a lively video and a corresponding news article, which is employed by influential news media including CNN, BBC, and social media including Twitter and Weibo. In such a case, automatically choosing a proper cover frame of the video and generating an appropriate textual summary of the article can help editors save time, and readers make the decision more effectively. Hence, in this paper, we propose the task of Videobased Multimodal Summarization with Multimodal Output (VMSMO) to tackle such a problem. The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article. To this end, we propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator. In the dual interaction module, we propose a conditional self-attention mechanism that captures local semantic information within video and a global-attention mechanism that handles the semantic relationship between news text and video from a high level. Extensive experiments conducted on a large-scale real-world VMSMO dataset 1 show that DIMS achieves the state-of-the-art performance in terms of both automatic metrics and human evaluations. * Equal contribution. Ordering is decided by a coin flip.

show abstract

Section: Related Workmentioning

confidence: 99%

“…How2: a model proposed to generate textual summary with video information (Palaskar et al, 2019). Synergistic: a image-question-answer synergistic network to value the role of the answer for precise visual dialog (Guo et al, 2019).…”

Section: Multimodal Baselinesmentioning

confidence: 99%

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Li¹,

Chen²,

Gao³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…VideoRNN (Palaskar et al, 2019): a baseline of the video-only model implemented on the How2 dataset.…”

Section: Baseline Modelsmentioning

confidence: 99%

“…Existing approaches have obtained promising results. For example, Libovickỳ et al (2018) and Palaskar et al (2019) utilize multiple encoders to encode videos and audio transcripts and a joint decoder to decode the multisource encodings, which acquire better performance than single modality structures. Despite the effectiveness of these approaches, they only perform multimodal fusion during the decoding stage to generate a target sequence, lacking fine-grained interactions between multisource inputs to complete the missing information of each modality.…”

Section: Introductionmentioning

confidence: 99%

Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

Liu¹,

Sun²,

Yu³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack finegrained multimodality interactions of multisource inputs. Besides, unlike other multimodal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, we propose a multistage fusion network with the fusion forget gate module, which builds upon this approach by modeling fine-grained interactions between the multisource modalities through a multistep fusion schema and controlling the flow of redundant information between multimodal long sequences via a forgetting module. Experimental results on the How2 dataset show that our proposed model achieves a new state-of-the-art performance. Comprehensive analysis empirically verifies the effectiveness of our fusion schema and forgetting module on multiple encoder-decoder architectures. Specially, when using high noise ASR transcripts (W ER>30%), our model still achieves performance close to the ground-truth transcript model, which reduces manual annotation cost.

show abstract

“…In particular, Ramanishka et al (2016) However -we suspect that instructional video domain is significantly different than MSR-VTT (where the audio information does not necessarily correspond to human speech), as we find that ASR-only models significantly surpass the state-of-the-art video model in our case. Palaskar et al (2019) and Shi et al (2019), contemporaneous with the submission of the present work, also examine ASR as a source of signal for generating how-to video captions.…”

Section: Related Workmentioning

confidence: 99%

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Hessel¹,

Pang²,

Zhu³

et al. 2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR tokens and visual features results in higher performance compared to training individually on either modality. We find that unstated background information is better explained by visual features, whereas fine-grained distinctions (e.g., "add oil" vs. "add olive oil") are disambiguated more easily via ASR tokens.

show abstract

Multimodal Abstractive Summarization for How2 Videos

Cited by 62 publications

References 43 publications

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Contact Info

Product

Resources

About