2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00750
|View full text |Cite
|
Sign up to set email alerts
|

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

Abstract: Textual-visual cross-modal retrieval has been a hot research topic in both computer vision and natural language processing communities. Learning appropriate representations for multi-modal data is crucial for the cross-modal retrieval performance. Unlike existing image-text retrieval approaches that embed image-text pairs as single feature vectors in a common representational space, we propose to incorporate generative processes into the cross-modal feature embedding, through which we are able to learn not onl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
209
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 368 publications
(209 citation statements)
references
References 25 publications
(41 reference statements)
0
209
0
Order By: Relevance
“…Faghri et al [5] focus more on hard negatives and obtain good improvement using a triplet loss. Gu et al [8] further improve the learning of cross-view feature embedding by incorporating generative objectives. Our work also belongs to this direction of learning joint space for image and sentence with an emphasis on improving image representations.…”
Section: Related Workmentioning
confidence: 99%
“…Faghri et al [5] focus more on hard negatives and obtain good improvement using a triplet loss. Gu et al [8] further improve the learning of cross-view feature embedding by incorporating generative objectives. Our work also belongs to this direction of learning joint space for image and sentence with an emphasis on improving image representations.…”
Section: Related Workmentioning
confidence: 99%
“…[10] proposed a model to learn semantic concepts and order for better image and sentence matching. Gu et al [9] leveraged generative models to learn concrete grounded representations that capture the detailed similarity between the two modalities. Lee et al [16] proposed stacked cross attention to exploit the correspondences between words and regions for discovering full latent alignments.…”
Section: Cross-modal Gated Fusionmentioning
confidence: 99%
“…Image Retrieval Method R@1 R@5 R@10 R@1 R@5 R@10 Order [27] 46.7 -88.9 37.9 -85.9 DPC [34] 65.6 89.8 95.5 47.1 79.9 90.0 VSE++ [5] 64.6 -95.7 52.0 -92.0 GXN [9] 68.5 -97.9 56.6 -94.5 SCO [10] 69.9 92.9 97.5 56.7 87.5 94.8 CMPM [33] 56. Table 1 presents our results compared with previous methods on 5k test images and 5 folds of 1k test images of COCO dataset, respectively.…”
Section: Coco 1k Test Images Caption Retrievalmentioning
confidence: 99%
“…In this section, selected applications for multimodal intelligence that combine vision and language are discussed, which include image captioning, text-to-image generation, and VQA. It is worth noting that there are other applications, such as text-based image retrieval [94], [164], [165], and visual-andlanguage navigation (VLN) [166]- [174], that we have not included in this paper due to space limitation.…”
Section: Applicationsmentioning
confidence: 99%