Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413961
|View full text |Cite
|
Sign up to set email alerts
|

Context-Aware Multi-View Summarization Network for Image-Text Matching

Abstract: Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most prior work is still confronted with a multi-view description challenge, i.e., how to align an image to multiple textual descriptions with semantic diversity. Toward this end, we present a novel context-aware multiview summarization network to summarize context-en… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
25
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 77 publications
(28 citation statements)
references
References 34 publications
(80 reference statements)
0
25
0
Order By: Relevance
“…Among the region-phrase-based methods, Niu et al [21] proposed a cross-modal attention model to align features from the two modalities at the global-to-global, global-to-local, and local-to-local levels in order to extract multi-granular features. However, these works require cross-modal operations for each image-text pair, which in- troduces a high computational cost [24]. Recently, Wang et al [32] proposed an approach that is free from cross-modal operations.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Among the region-phrase-based methods, Niu et al [21] proposed a cross-modal attention model to align features from the two modalities at the global-to-global, global-to-local, and local-to-local levels in order to extract multi-granular features. However, these works require cross-modal operations for each image-text pair, which in- troduces a high computational cost [24]. Recently, Wang et al [32] proposed an approach that is free from cross-modal operations.…”
Section: Related Workmentioning
confidence: 99%
“…One popular cross-modal alignment strategy involves adopting attention models to acquire correspondences between body parts and words [17,16,2]. However, this strategy depends on cross-modal operations for each image-text pair, which are computationally expensive [24]. Another intuitive strategy involves splitting one textual description into several groups of noun phrases by using external tools, e.g.…”
Section: Introductionmentioning
confidence: 99%
“…Different from visual representation, text representation does not seem to have great differences. Most methods use the powerful pretrained language model Bert [5] to get text representation, and some methods [6,8,17,19,24,28] also use GRU [2,31].…”
Section: Textual Representationsmentioning
confidence: 99%
“…CAMERA [28] does not use a pair of image-text data for training but adds image-text joint training for multiview descriptions, and selects content information through an attention module, which also takes advantage of intra-modal interactions and inter-modal interactions (c). Although CAMERA also used a contrastive loss similar to previous works, CAMERA introduces a diversity regularization term that causes a difference in the loss term.…”
Section: Pretrained Modelsmentioning
confidence: 99%
“…the given image (visual question answering [2,9,13]), generating multimodal responses based on user intentions (multimodal task-oriented dialog [25]), or describing what they see with a natural sentence (image captioning [1,6,42,43,45,46]). With the development of deep learning techniques, there has been a steady momentum of breakthroughs that push the limits of visionlanguage tasks [32,44]. Despite having promising quantitative results, the achievements rely heavily on the requirement of large quantities of task-specific annotations (e.g., image-questionanswer triplets/image-sentence pairs) for such neural model learning.…”
Section: Introductionmentioning
confidence: 99%