Multimodal Summarization with Guidance of Multimodal Reference

Zhu, Junnan; Zhou, Yu; Zhang, Jiajun; Li, Haoran; Zong, Chengqing; Li, Changliang

doi:10.1609/aaai.v34i05.6525

Cited by 66 publications

(66 citation statements)

References 20 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…proposed to jointly generate textual summary and select the most relevant image from 6 candidates. Following their work, Zhu et al (2020) added a multimodal objective function to use the loss from the textual summary generation and the image selection. However, in the real-world application, we usually need to choose the cover figure for a continuous video consisting of hundreds of frames.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Li¹,

Chen²,

Gao³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

A popular multimedia news format nowadays is providing users with a lively video and a corresponding news article, which is employed by influential news media including CNN, BBC, and social media including Twitter and Weibo. In such a case, automatically choosing a proper cover frame of the video and generating an appropriate textual summary of the article can help editors save time, and readers make the decision more effectively. Hence, in this paper, we propose the task of Videobased Multimodal Summarization with Multimodal Output (VMSMO) to tackle such a problem. The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article. To this end, we propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator. In the dual interaction module, we propose a conditional self-attention mechanism that captures local semantic information within video and a global-attention mechanism that handles the semantic relationship between news text and video from a high level. Extensive experiments conducted on a large-scale real-world VMSMO dataset 1 show that DIMS achieves the state-of-the-art performance in terms of both automatic metrics and human evaluations. * Equal contribution. Ordering is decided by a coin flip.

show abstract

Section: Related Workmentioning

confidence: 99%

“…MSMO: the first model on multi-output task, which paid attention to text and images during generating textual summary and used coverage to help select picture . MOF: the model based on MSMO which added consideration of image accuracy as another loss (Zhu et al, 2020).…”

Section: Multimodal Baselinesmentioning

confidence: 99%

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Li¹,

Chen²,

Gao³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Narayan et al (2017) develop extractive summarization with side information including images and captions. , Chen and Zhuge (2018) and Zhu et al (2020) propose to generate multimodal summary for multimodal news document. Li et al (2018a) first introduce the multimodal sentence summarization task, and they propose a hierarchical attention model, which can pay different attention to image patches, words, and different modalities when decoding target words.…”

Section: Multimodal Seq2seq Modelsmentioning

confidence: 99%

Multimodal Sentence Summarization via Multimodal Selective Encoding

Li¹,

Zhu²,

Zhang³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

Self Cite

View full text Add to dashboard Cite

This paper studies the problem of generating a summary for a given sentence-image pair. Existing multimodal sequence-to-sequence approaches mainly focus on enhancing the decoder by visual signals, while ignoring that the image can improve the ability of the encoder to identify highlights of a news event or a document. Thus, we propose a multimodal selective gate network that considers reciprocal relationships between textual and multi-level visual features, including global image descriptor, activation grids, and object proposals, to select highlights of the event when encoding the source sentence. In addition, we introduce a modality regularization to encourage the summary to capture the highlights embedded in the image more accurately. To verify the generalization of our model, we adopt the multimodal selective gate to the text-based decoder and multimodal-based decoder. Experimental results on a public multimodal sentence summarization dataset demonstrate the advantage of our models over baselines. Further analysis suggests that our proposed multimodal selective gate network can effectively select important information in the input sentence.

show abstract

“…According to different tasks, the input modalities are also different, such as text+image (Wang et al, 2012;Bian et al, 2013Bian et al, , 2014Wang et al, 2016), andvideo+audio+text (Evangelopoulos et al, 2013;Li et al, 2017), which mainly focus on extractive approaches. With the popularity of sequence-to-sequence learning (Sutskever et al, 2014), the use of corpora with human-written summaries for multimodal abstractive summarization has attracted interest Zhu et al, , 2020.…”

Section: Related Workmentioning

confidence: 99%

Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

Liu¹,

Sun²,

Yu³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack finegrained multimodality interactions of multisource inputs. Besides, unlike other multimodal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, we propose a multistage fusion network with the fusion forget gate module, which builds upon this approach by modeling fine-grained interactions between the multisource modalities through a multistep fusion schema and controlling the flow of redundant information between multimodal long sequences via a forgetting module. Experimental results on the How2 dataset show that our proposed model achieves a new state-of-the-art performance. Comprehensive analysis empirically verifies the effectiveness of our fusion schema and forgetting module on multiple encoder-decoder architectures. Specially, when using high noise ASR transcripts (W ER>30%), our model still achieves performance close to the ground-truth transcript model, which reduces manual annotation cost.

show abstract

Multimodal Summarization with Guidance of Multimodal Reference

Cited by 66 publications

References 20 publications

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Multimodal Sentence Summarization via Multimodal Selective Encoding

Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

Contact Info

Product

Resources

About