Dual-path Convolutional Image-Text Embeddings with Instance Loss

Zheng, Zhedong; Zheng, Liang; Garrett, Michael; Yang, Yi; Xu, Mingliang; Shen, Yi-Dong

doi:10.1145/3383184

Cited by 317 publications

(121 citation statements)

References 77 publications

(119 reference statements)

Supporting

Mentioning

118

Contrasting

Order By: Relevance

“…Similarly, Zheng et al put forward a method using CNN to learn both text features and image features, using classification loss function to pre-train distinguishing features, and then using ranking function to train features matching. The method achieved good person search performance [ 17 ]. Chen et al proposed a text and image block matching method to catch the local similarity [ 32 ].…”

Section: Related Workmentioning

confidence: 99%

“…To deal with the challenge of cross-modality, we proposed a cross-modal attention mechanism and joint loss function. Most of the existing works have independently extracted the features of each modality, and then measured the cross-modal correlation and similarity of the features [ 7 , 17 ]. However, the features of different modalities are quite different and noisy, and the correlation between features is not strong because of cross-modal heterogeneity.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Hybrid Attention Network for Language-Based Person Search

Xiao

2020

Sensors

View full text Add to dashboard Cite

Language-based person search retrieves images of a target person using natural language description and is a challenging fine-grained cross-modal retrieval task. A novel hybrid attention network is proposed for the task. The network includes the following three aspects: First, a cubic attention mechanism for person image, which combines cross-layer spatial attention and channel attention. It can fully excavate both important midlevel details and key high-level semantics to obtain better discriminative fine-grained feature representation of a person image. Second, a text attention network for language description, which is based on bidirectional LSTM (BiLSTM) and self-attention mechanism. It can better learn the bidirectional semantic dependency and capture the key words of sentences, so as to extract the context information and key semantic features of the language description more effectively and accurately. Third, a cross-modal attention mechanism and a joint loss function for cross-modal learning, which can pay more attention to the relevant parts between text and image features. It can better exploit both the cross-modal and intra-modal correlation and can better solve the problem of cross-modal heterogeneity. Extensive experiments have been conducted on the CUHK-PEDES dataset. Our approach obtains higher performance than state-of-the-art approaches, demonstrating the advantage of the approach we propose.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Hybrid Attention Network for Language-Based Person Search

Xiao

2020

Sensors

View full text Add to dashboard Cite

show abstract

“…where Φ cnn , Φ lstm and Φ linear are learning parameters of the CNN, LSTM and linear transformation, respectively. During learning, we follow (Faghri et al 2017;Zheng et al 2017) to exploit a two-step learning strategy to prevent the over-fitting problem. In the first step, we fix the parameters of pretrained residual CNN, and only optimize the parameters of LSTM and linear transformation.…”

Section: Common Visual-semantic Embeddingmentioning

confidence: 99%

Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding

Huang

Long

Wang³

2019

AAAI

View full text Add to dashboard Cite

Although image and sentence matching has been widely studied, its intrinsic few-shot problem is commonly ignored, which has become a bottleneck for further performance improvement. In this work, we focus on this challenging problem of few-shot image and sentence matching, and propose a Gated Visual-Semantic Embedding (GVSE) model to deal with it. The model consists of three corporative modules in terms of uncommon VSE, common VSE, and gated metric fusion. The uncommon VSE exploits external auxiliary resources to extract generic features for representing uncommon instances and words in images and sentences, and then integrates them by modeling their semantic relation to obtain global representations for association analysis. To better model other common instances and words in rest content of images and sentences, the common VSE learns their discriminative representations directly from scratch. After obtaining two similarity metrics from the two VSE modules with different advantages, the gated metric fusion module adaptively fuses them by automatically balancing their relative importance. Based on the fused metric, we perform extensive experiments in terms of few-shot and conventional image and sentence matching, and demonstrate the effectiveness of the proposed model by achieving the state-of-the-art results on two public benchmark datasets.

show abstract

“…Zheng et al (Zheng et al 2017) proposed the instance loss for instance-level image-text matching. Based on the assumption that each image/text group is distinct, they viewed each image/text group as a class.…”

Section: Unsupervised Training With Instance Lossmentioning

confidence: 99%

“…a very limited training dataset containing about one-seventh of (Gordo et al 2016) with only image level labels. For the unsupervised mode, motivated by the instance loss for image-text matching (Zheng et al 2017), we leverage it to train our multiple saliency block (MSB) without any supervised information. For the supervised one, we use standard classification loss and triplet loss with batch hard mining (Hermans, Beyer, and Leibe 2017) to fine-tune the whole network for two-stage learning with only image-level labels.…”

mentioning

confidence: 99%

Multiple Saliency and Channel Sensitivity Network for Aggregated Convolutional Feature

Xiang

Wang

Zhao

et al. 2019

AAAI

View full text Add to dashboard Cite

In this paper, aiming at two key problems of instance-level image retrieval, i.e., the distinctiveness of image representation and the generalization ability of the model, we propose a novel deep architecture - Multiple Saliency and Channel Sensitivity Network(MSCNet). Specifically, to obtain distinctive global descriptors, an attention-based multiple saliency learning is first presented to highlight important details of the image, and then a simple but effective channel sensitivity module based on Gram matrix is designed to boost the channel discrimination and suppress redundant information. Additionally, in contrast to most existing feature aggregation methods, employing pre-trained deep networks, MSCNet can be trained in two modes: the first one is an unsupervised manner with an instance loss, and another is a supervised manner, which combines classification and ranking loss and only relies on very limited training data. Experimental results on several public benchmark datasets, i.e., Oxford buildings, Paris buildings and Holidays, indicate that the proposed MSCNet outperforms the state-of-the-art unsupervised and supervised methods.

show abstract

Dual-path Convolutional Image-Text Embeddings with Instance Loss

Cited by 317 publications

References 77 publications

Hybrid Attention Network for Language-Based Person Search

Hybrid Attention Network for Language-Based Person Search

Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding

Multiple Saliency and Channel Sensitivity Network for Aggregated Convolutional Feature

Contact Info

Product

Resources

About