Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.496
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Sentence Summarization via Multimodal Selective Encoding

Abstract: This paper studies the problem of generating a summary for a given sentence-image pair. Existing multimodal sequence-to-sequence approaches mainly focus on enhancing the decoder by visual signals, while ignoring that the image can improve the ability of the encoder to identify highlights of a news event or a document. Thus, we propose a multimodal selective gate network that considers reciprocal relationships between textual and multi-level visual features, including global image descriptor, activation grids, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 19 publications
(9 citation statements)
references
References 49 publications
0
9
0
Order By: Relevance
“…MMAF and MMCF [ 55 ] are the modality-based attention mechanism for paying a different kind of attention to image patches and text units, which are filtered through selective visual information. Considering a selective gate network for reciprocal relationships between textual and multi-level visual features, SELECT [ 40 ] is the current SOTA baseline.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…MMAF and MMCF [ 55 ] are the modality-based attention mechanism for paying a different kind of attention to image patches and text units, which are filtered through selective visual information. Considering a selective gate network for reciprocal relationships between textual and multi-level visual features, SELECT [ 40 ] is the current SOTA baseline.…”
Section: Methodsmentioning
confidence: 99%
“…Considering visual information as a complement to textual features for generation [ 7 ], Zhu et al [ 39 ] propose a multimodal input and multimodal output dataset, as well as an attention model to generate a summary through a text-guided mechanism. The model Select [ 40 ] proposes a selective gate module for integrating reciprocal relationships, including a global image descriptor, activation grids and object proposals. Modeling the correlation among inputs is the core point of MAS.…”
Section: Related Workmentioning
confidence: 99%
“…[19] propose a multimodal attention model for How2 videos [25]. [13] design a gate to select event highlights from images and distinguishes highlights in encoding. [18] propose a local-global attention mechanism to let the video and text interact and select an image as output.…”
Section: Related Workmentioning
confidence: 99%
“…As for the text-only summarization, many works [26,4,3] aim to improve the word overlap metric, such as ROUGE. Similar to these works, most of the MS models also focus on improving the overlap toward the source text only [11,19,13], while the visual content is considered as the supplemental data handled independently. However, the visual content may convey some other information besides the textual content.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation