Multi-modal Sentence Summarization with Modality Attention and Image Filtering

Multimodal summarization has drawn much attention due to the rapid growth of multimedia data. The output of the current multimodal summarization systems is usually represented in texts. However, we have found through experiments that multimodal output can significantly improve user satisfaction for informativeness of summaries. In this paper, we propose a novel task, multimodal summarization with multimodal output (MSMO). To handle this task, we first collect a large-scale dataset for MSMO research. We then propose a multimodal attention model to jointly generate text and select the most relevant image from the multimodal input. Finally, to evaluate multimodal outputs, we construct a novel multimodal automatic evaluation (MMAE) method which considers both intramodality salience and intermodality relevance. The experimental results show the effectiveness of MMAE.

show abstract

“…Multimodal Attention. To fuse the text and visual context information, we add a multimodal attention layer (Li et al, 2018a), as shown in Fig. 2.…”

Section: Multimodal Attention Modelmentioning

confidence: 99%

MSMO: Multimodal Summarization with Multimodal Output

Zhu¹,

Li²,

Liu³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

124

143

View full text Add to dashboard Cite

show abstract

“…Video summarization [17,28,30] is also a major sub-domain of multi-modal summarization. A few deep learning frameworks [2,11,31] show promising results, too. Li et al [12] uses an asynchronous dataset containing text, images and videos to generate a textual summary.…”

Section: Related Workmentioning

confidence: 99%

“…Summarization can help tackle this problem by distilling the most significant information from the plethora of available content. Recent research in summarization [2,11,31] has proven that having multi-modal data can improve the quality of summary in comparison to uni-modal summaries. Multi-modal information can help users gain deeper insights.…”

Section: Introductionmentioning

confidence: 99%

“…Although visual representation of information is more expressive and comprehensive in comparison to textual description of the same information, it is still not a thorough model of representation. Encoding abstract concepts like guilt or freedom [11], geographical locations or environmental features like temperature, humidity etc. via images is impractical.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Text-Image-Video Summary Generation Using Joint Integer Linear Programming

Jangra

Jatowt

Hasanuzzaman

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Automatically generating a summary for asynchronous data can help users to keep up with the rapid growth of multi-modal information on the Internet. However, the current multi-modal systems usually generate summaries composed of text and images. In this paper, we propose a novel research problem of text-image-video summary generation (TIVS). We first develop a multi-modal dataset containing text documents, images and videos. We then propose a novel joint integer linear programming multi-modal summarization (JILP-MMS) framework. We report the performance of our model on the developed dataset.

show abstract

“…Inspired by the above observations, we propose a model called Attribute-aware Sequence Network (ASN) to consider attribute information into review summarization. Specifically, ASN is based on sequence to sequence models (S2S), which are popular methods in text summarization (Rush et al, 2015;See et al, 2017;Li et al, 2018a) and review summarization (Wang and Ling, 2016;Ma et al, 2018). ASN updates over standard S2S are three-fold.…”

Section: Introductionmentioning

confidence: 99%

Attribute-aware Sequence Network for Review Summarization

Li¹,

Wang²,

Yin³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

View full text Add to dashboard Cite

Review summarization aims to generate a condensed summary for a review or multiple reviews. Existing review summarization systems mainly generate summary only based on review content and neglect the authors' attributes (e.g., gender, age, and occupation). In fact, when summarizing a review, users with different attributes usually pay attention to specific aspects and have their own word-using habits or writing styles. Therefore, we propose an Attribute-aware Sequence Network (ASN) to take the aforementioned users' characteristics into account, which includes three modules: an attribute encoder encodes the attribute preferences over the words; an attribute-aware review encoder adopts an attribute-based selective mechanism to select the important information of a review; and an attribute-aware summary decoder incorporates attribute embedding and attribute-specific word-using habits into word prediction. To validate our model, we collect a new dataset TripAtt, comprising 495,440 attribute-review-summary triplets with three kinds of attribute information: gender, age, and travel status. Extensive experiments show that ASN achieves state-of-the-art performance on review summarization in both auto-metric ROUGE and human evaluation.

show abstract

Multi-modal Sentence Summarization with Modality Attention and Image Filtering

Cited by 56 publications

References 8 publications

MSMO: Multimodal Summarization with Multimodal Output

MSMO: Multimodal Summarization with Multimodal Output

Text-Image-Video Summary Generation Using Joint Integer Linear Programming

Attribute-aware Sequence Network for Review Summarization

Contact Info

Product

Resources

About