Multimodal Sentence Summarization via Multimodal Selective Encoding

Li, Haoran; Zhu, Junnan; Zhang, Jiajun; He, Xiaodong

doi:10.18653/v1/2020.coling-main.496

Cited by 19 publications

(9 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MMAF and MMCF [ 55 ] are the modality-based attention mechanism for paying a different kind of attention to image patches and text units, which are filtered through selective visual information. Considering a selective gate network for reciprocal relationships between textual and multi-level visual features, SELECT [ 40 ] is the current SOTA baseline.…”

Section: Methodsmentioning

confidence: 99%

“…Considering visual information as a complement to textual features for generation [ 7 ], Zhu et al [ 39 ] propose a multimodal input and multimodal output dataset, as well as an attention model to generate a summary through a text-guided mechanism. The model Select [ 40 ] proposes a selective gate module for integrating reciprocal relationships, including a global image descriptor, activation grids and object proposals. Modeling the correlation among inputs is the core point of MAS.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

Zhang²,

Wang

et al. 2022

Entropy

View full text Add to dashboard Cite

Internet users are benefiting from technologies of abstractive summarization enabling them to view articles on the internet by reading article summaries only instead of an entire article. However, there are disadvantages to technologies for analyzing articles with texts and images due to the semantic gap between vision and language. These technologies focus more on aggregating features and neglect the heterogeneity of each modality. At the same time, the lack of consideration of intrinsic data properties within each modality and semantic information from cross-modal correlations result in the poor quality of learned representations. Therefore, we propose a novel Inter- and Intra-modal Contrastive Hybrid learning framework which learns to automatically align the multimodal information and maintains the semantic consistency of input/output flows. Moreover, ITCH can be taken as a component to make the model suitable for both supervised and unsupervised learning approaches. Experiments on two public datasets, MMS and MSMO, show that the ITCH performances are better than the current baselines.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

Zhang²,

Wang

et al. 2022

Entropy

View full text Add to dashboard Cite

show abstract

“…[19] propose a multimodal attention model for How2 videos [25]. [13] design a gate to select event highlights from images and distinguishes highlights in encoding. [18] propose a local-global attention mechanism to let the video and text interact and select an image as output.…”

Section: Related Workmentioning

confidence: 99%

“…As for the text-only summarization, many works [26,4,3] aim to improve the word overlap metric, such as ROUGE. Similar to these works, most of the MS models also focus on improving the overlap toward the source text only [11,19,13], while the visual content is considered as the supplemental data handled independently. However, the visual content may convey some other information besides the textual content.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Zhang¹,

Zhang²,

Guo³

et al. 2023

Preprint

View full text Add to dashboard Cite

Multimodal summarization (MS) aims to generate a summary from multimodal input. Previous works mainly focus on textual semantic coverage metrics such as ROUGE, which considers the visual content as supplemental data. Therefore, the summary is ineffective to cover the semantics of different modalities. This paper proposes a multi-task cross-modality learning framework (CISum) to improve multimodal semantic coverage by learning the cross-modality interaction in the multimodal article. To obtain the visual semantics, we translate images into visual descriptions based on the correlation with text content. Then, the visual description and text content are fused to generate the textual summary to capture the semantics of the multimodal content, and the most relevant image is selected as the visual summary. Furthermore, we design an automatic multimodal semantics coverage metric to evaluate the performance. Experimental results show that CISum outperforms baselines in multimodal semantics coverage metrics while maintaining the excellent performance of ROUGE and BLEU.

show abstract

Overview of the NLPCC 2022 Shared Task on Multimodal Product Summarization

Yuan

Zhang

et al. 2022

Natural Language Processing and Chinese Computing

View full text Add to dashboard Cite

Multimodal Sentence Summarization via Multimodal Selective Encoding

Cited by 19 publications

References 49 publications

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Overview of the NLPCC 2022 Shared Task on Multimodal Product Summarization

Contact Info

Product

Resources

About