CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

Wang, Zihao; Liu, Xihui; Li, Hongsheng; Sheng, Lu; Yan, Junjie; Wang, Xiaogang; Shao, Jing

doi:10.1109/iccv.2019.00586

Cited by 254 publications

(149 citation statements)

References 36 publications

(85 reference statements)

Supporting

Mentioning

135

Contrasting

Order By: Relevance

“…According to the granularity of representation, studies on imagetext matching can be categorized into two groups: 1) global embedding based methods [5,6,34,44], and 2) local inference based methods [3,16,18,21,26,37]. The former ones first embed the whole images and sentences into a joint embedding space, and then calculate the visual-semantic similarity.…”

Section: Related Work 21 Image-text Matchingmentioning

confidence: 99%

“…For instance, Lee et al [16] presented a cross attention to align both image regions and words to infer image-text similarity. Wang et al [37] proposed a crossmodal adaptive message passing method to perform fine-grained interaction and filter irrelevant information with a gating strategy. Liu et al [21] designed a focal attention network comprising the preassigning and re-assigning attention, which focuses on eliminating irrelevant fragments.…”

Section: Related Work 21 Image-text Matchingmentioning

confidence: 99%

“…To justify the effectiveness of our proposal, we compared our proposed CAMERA with the following state-of-the-art baselines: VQA [20], sm-LSTM [11], VSE++ [5], CMPM + CMPC [44], SCAN [16], SCG [31], TIMAM [30], SAEM [40], CAMP [37], BFAN [21], PFAN [36], SAN [12], VSRN [17], and SGM [35]. Note that we directly quoted the results from their original papers.…”

Section: Performance Comparisonmentioning

confidence: 99%

“…As a fundamental task of multimodal interaction, image-text matching, focusing on measuring the semantic similarity between an image and a text, has attracted extensive research attention. It indeed facilitates various applications, such as cross-modal retrieval [3,37], visual question answering (VQA) [2,24] and multimedia understanding [10,38,39].…”

Section: Introductionmentioning

confidence: 99%

“…However, it would be difficult for such a compact representation to capture fine-grained semantic details in texts and images. Afterwards, to further exploit the fine-grained relationships between images and texts, several studies [16,18,37] have adopted the attention mechanism to perform semantic alignments between visual regions and words. Particularly, they aggregate fragment features guided by local cross-modal affinities, i.e., region-word similarities, to obtain global representations.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Context-Aware Multi-View Summarization Network for Image-Text Matching

Qu¹,

Cao

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most prior work is still confronted with a multi-view description challenge, i.e., how to align an image to multiple textual descriptions with semantic diversity. Toward this end, we present a novel context-aware multiview summarization network to summarize context-enhanced visual region information from multiple views. To be more specific, we design an adaptive gating self-attention module to extract representations of visual regions and words. By controlling the internal information flow, we are able to adaptively capture context information. Afterwards, we introduce a summarization module with a diversity regularization to aggregate region-level features into image-level ones from different perspectives. Ultimately, we devise a multi-view matching scheme to match multi-view image features with corresponding text ones. To justify our work, we have conducted extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, which demonstrates the superiority of our model as compared to several state-of-the-art baselines. CCS CONCEPTS • Information systems → Multimedia and multimodal retrieval; Novelty in information retrieval.

show abstract