Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

Gu, Jiuxiang; Cai, Jianfei; Joty, Shafiq; Niu, Li; Wang, Gang

doi:10.1109/cvpr.2018.00750

Cited by 368 publications

(209 citation statements)

References 25 publications

(41 reference statements)

Supporting

Mentioning

209

Contrasting

Order By: Relevance

“…Faghri et al [5] focus more on hard negatives and obtain good improvement using a triplet loss. Gu et al [8] further improve the learning of cross-view feature embedding by incorporating generative objectives. Our work also belongs to this direction of learning joint space for image and sentence with an emphasis on improving image representations.…”

Section: Related Workmentioning

confidence: 99%

Visual Semantic Reasoning for Image-Text Matching

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

446

366

View full text Add to dashboard Cite

Image-text matching has been a hot research topic bridging the vision and language areas. It remains challenging because the current representation of image usually lacks global semantic concepts as in its corresponding text caption. To address this issue, we propose a simple and interpretable reasoning model to generate visual representation that captures key objects and semantic concepts of a scene. Specifically, we first build up connections between image regions and perform reasoning with Graph Convolutional Networks to generate features with semantic relationships. Then, we propose to use the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually generate the representation for the whole scene. Experiments validate that our method achieves a new state-of-the-art for the image-text matching on MS-COCO [28] and Flickr30K [39] datasets. It outperforms the current best method by 6.8% relatively for image retrieval and 4.8% relatively for caption retrieval on MS-COCO (Recall@1 using 1K test set). On Flickr30K, our model improves image retrieval by 12.6% relatively and caption retrieval by 5.8% relatively (Re-call@1). Our code is available at https://github. com/KunpengLi1994/VSRN .

show abstract

Section: Related Workmentioning

confidence: 99%

Visual Semantic Reasoning for Image-Text Matching

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

446

366

View full text Add to dashboard Cite

show abstract

“…[10] proposed a model to learn semantic concepts and order for better image and sentence matching. Gu et al [9] leveraged generative models to learn concrete grounded representations that capture the detailed similarity between the two modalities. Lee et al [16] proposed stacked cross attention to exploit the correspondences between words and regions for discovering full latent alignments.…”

Section: Cross-modal Gated Fusionmentioning

confidence: 99%

“…Image Retrieval Method R@1 R@5 R@10 R@1 R@5 R@10 Order [27] 46.7 -88.9 37.9 -85.9 DPC [34] 65.6 89.8 95.5 47.1 79.9 90.0 VSE++ [5] 64.6 -95.7 52.0 -92.0 GXN [9] 68.5 -97.9 56.6 -94.5 SCO [10] 69.9 92.9 97.5 56.7 87.5 94.8 CMPM [33] 56. Table 1 presents our results compared with previous methods on 5k test images and 5 folds of 1k test images of COCO dataset, respectively.…”

Section: Coco 1k Test Images Caption Retrievalmentioning

confidence: 99%

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

Wang

Liu

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

254

135

View full text Add to dashboard Cite

Text-image cross-modal retrieval is a challenging task in the field of language and vision. Most previous approaches independently embed images and sentences into a joint embedding space and compare their similarities. However, previous approaches rarely explore the interactions between images and sentences before calculating similarities in the joint space. Intuitively, when matching between images and sentences, human beings would alternatively attend to regions in images and words in sentences, and select the most salient information considering the interaction between both modalities. In this paper, we propose Crossmodal Adaptive Message Passing (CAMP), which adaptively controls the information flow for message passing across modalities. Our approach not only takes comprehensive and fine-grained cross-modal interactions into account, but also properly handles negative pairs and irrelevant information with an adaptive gating scheme. Moreover, instead of conventional joint embedding approaches for textimage matching, we infer the matching score based on the fused features, and propose a hardest negative binary crossentropy loss for training. Results on COCO and Flickr30k significantly surpass state-of-the-art methods, demonstrating the effectiveness of our approach.

show abstract

“…In this section, selected applications for multimodal intelligence that combine vision and language are discussed, which include image captioning, text-to-image generation, and VQA. It is worth noting that there are other applications, such as text-based image retrieval [94], [164], [165], and visual-andlanguage navigation (VLN) [166]- [174], that we have not included in this paper due to space limitation.…”

Section: Applicationsmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

262

View full text Add to dashboard Cite

Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities.This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles -learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.

show abstract

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

Cited by 368 publications

References 25 publications

Visual Semantic Reasoning for Image-Text Matching

Visual Semantic Reasoning for Image-Text Matching

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Contact Info

Product

Resources

About