2020
DOI: 10.1145/3383184
|View full text |Cite
|
Sign up to set email alerts
|

Dual-path Convolutional Image-Text Embeddings with Instance Loss

Abstract: Matching images and sentences demands a fine understanding of both modalities. In this article, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image/text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss on heterogeneous features (i.e., text and image features) is less effective, because it is hard to find appropriate tr… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
118
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 317 publications
(121 citation statements)
references
References 77 publications
(119 reference statements)
3
118
0
Order By: Relevance
“…Similarly, Zheng et al put forward a method using CNN to learn both text features and image features, using classification loss function to pre-train distinguishing features, and then using ranking function to train features matching. The method achieved good person search performance [ 17 ]. Chen et al proposed a text and image block matching method to catch the local similarity [ 32 ].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Similarly, Zheng et al put forward a method using CNN to learn both text features and image features, using classification loss function to pre-train distinguishing features, and then using ranking function to train features matching. The method achieved good person search performance [ 17 ]. Chen et al proposed a text and image block matching method to catch the local similarity [ 32 ].…”
Section: Related Workmentioning
confidence: 99%
“…To deal with the challenge of cross-modality, we proposed a cross-modal attention mechanism and joint loss function. Most of the existing works have independently extracted the features of each modality, and then measured the cross-modal correlation and similarity of the features [ 7 , 17 ]. However, the features of different modalities are quite different and noisy, and the correlation between features is not strong because of cross-modal heterogeneity.…”
Section: Introductionmentioning
confidence: 99%
“…where Φ cnn , Φ lstm and Φ linear are learning parameters of the CNN, LSTM and linear transformation, respectively. During learning, we follow (Faghri et al 2017;Zheng et al 2017) to exploit a two-step learning strategy to prevent the over-fitting problem. In the first step, we fix the parameters of pretrained residual CNN, and only optimize the parameters of LSTM and linear transformation.…”
Section: Common Visual-semantic Embeddingmentioning
confidence: 99%
“…Zheng et al (Zheng et al 2017) proposed the instance loss for instance-level image-text matching. Based on the assumption that each image/text group is distinct, they viewed each image/text group as a class.…”
Section: Unsupervised Training With Instance Lossmentioning
confidence: 99%
“…a very limited training dataset containing about one-seventh of (Gordo et al 2016) with only image level labels. For the unsupervised mode, motivated by the instance loss for image-text matching (Zheng et al 2017), we leverage it to train our multiple saliency block (MSB) without any supervised information. For the supervised one, we use standard classification loss and triplet loss with batch hard mining (Hermans, Beyer, and Leibe 2017) to fine-tune the whole network for two-stage learning with only image-level labels.…”
mentioning
confidence: 99%