2020
DOI: 10.1609/aaai.v34i05.6525
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Summarization with Guidance of Multimodal Reference

Abstract: Multimodal summarization with multimodal output (MSMO) is to generate a multimodal summary for a multimodal news report, which has been proven to effectively improve users' satisfaction. The existing MSMO methods are trained by the target of text modality, leading to the modality-bias problem that ignores the quality of model-selected image during training. To alleviate this problem, we propose a multimodal objective function with the guidance of multimodal reference to use the loss from the summary generation… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
65
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 66 publications
(66 citation statements)
references
References 20 publications
(36 reference statements)
0
65
0
Order By: Relevance
“…proposed to jointly generate textual summary and select the most relevant image from 6 candidates. Following their work, Zhu et al (2020) added a multimodal objective function to use the loss from the textual summary generation and the image selection. However, in the real-world application, we usually need to choose the cover figure for a continuous video consisting of hundreds of frames.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…proposed to jointly generate textual summary and select the most relevant image from 6 candidates. Following their work, Zhu et al (2020) added a multimodal objective function to use the loss from the textual summary generation and the image selection. However, in the real-world application, we usually need to choose the cover figure for a continuous video consisting of hundreds of frames.…”
Section: Related Workmentioning
confidence: 99%
“…MSMO: the first model on multi-output task, which paid attention to text and images during generating textual summary and used coverage to help select picture . MOF: the model based on MSMO which added consideration of image accuracy as another loss (Zhu et al, 2020).…”
Section: Multimodal Baselinesmentioning
confidence: 99%
“…Narayan et al (2017) develop extractive summarization with side information including images and captions. , Chen and Zhuge (2018) and Zhu et al (2020) propose to generate multimodal summary for multimodal news document. Li et al (2018a) first introduce the multimodal sentence summarization task, and they propose a hierarchical attention model, which can pay different attention to image patches, words, and different modalities when decoding target words.…”
Section: Multimodal Seq2seq Modelsmentioning
confidence: 99%
“…According to different tasks, the input modalities are also different, such as text+image (Wang et al, 2012;Bian et al, 2013Bian et al, , 2014Wang et al, 2016), andvideo+audio+text (Evangelopoulos et al, 2013;Li et al, 2017), which mainly focus on extractive approaches. With the popularity of sequence-to-sequence learning (Sutskever et al, 2014), the use of corpora with human-written summaries for multimodal abstractive summarization has attracted interest Zhu et al, , 2020.…”
Section: Related Workmentioning
confidence: 99%