Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval

Li, Chao; Deng, Cheng; Li, Ning; Liu, Wei; Gao, Xinbo; Tao, Dacheng

doi:10.1109/cvpr.2018.00446

Cited by 378 publications

(220 citation statements)

References 42 publications

(66 reference statements)

Supporting

Mentioning

215

Contrasting

Unclassified

Order By: Relevance

“…The performance of DistillHash first increases and then keeps at a relatively high level. The result is also not sensitive to p in the range of [32,128] . For other experiments in this paper, we select p as 48.…”

Section: Parameter Sensitivitymentioning

confidence: 92%

DistillHash: Unsupervised Deep Hashing by Distilling Data Pairs

Yang

Liu

Deng

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

130

View full text Add to dashboard Cite

Due to the high storage and search efficiency, hashing has become prevalent for large-scale similarity search. Particularly, deep hashing methods have greatly improved the search performance under supervised scenarios. In contrast, unsupervised deep hashing models can hardly achieve satisfactory performance due to the lack of reliable supervisory similarity signals. To address this issue, we propose a novel deep unsupervised hashing model, dubbed Distill-Hash, which can learn a distilled data set consisted of data pairs, which have confidence similarity signals. Specifically, we investigate the relationship between the initial noisy similarity signals learned from local structures and the semantic similarity labels assigned by a Bayes optimal classifier. We show that under a mild assumption, some data pairs, of which labels are consistent with those assigned by the Bayes optimal classifier, can be potentially distilled. Inspired by this fact, we design a simple yet effective strategy to distill data pairs automatically and further adopt a Bayesian learning framework to learn hash functions from the distilled data set. Extensive experimental results on three widely used benchmark datasets show that the proposed DistillHash consistently accomplishes the stateof-the-art search performance.

show abstract

Section: Parameter Sensitivitymentioning

confidence: 92%

DistillHash: Unsupervised Deep Hashing by Distilling Data Pairs

Yang

Liu

Deng

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

130

View full text Add to dashboard Cite

show abstract

“…On the other hand, supervised hashing methods [38], [39], [40], [41], [42] take full advantage of the label information to mitigate the semantic gap and improve the hashing quality, therefore attaining higher search accuracy than the unsupervised methods. In semantic correlation maximization hashing (SCMH) [39], semantic labels are merged into the hash learning procedure for large-scale data modeling.…”

Section: Cross-modal Hashingmentioning

confidence: 99%

Error-Corrected Margin-Based Deep Cross-Modal Hashing for Facial Image Retrieval

Taherkhani

Talreja

Valenti

et al. 2020

IEEE Trans. Biom. Behav. Identity Sci.

View full text Add to dashboard Cite

Cross-modal hashing facilitates mapping of heterogeneous multimedia data into a common Hamming space, which can be utilized for fast and flexible retrieval across different modalities. In this paper, we propose a novel cross-modal hashing architecture-deep neural decoder cross-modal hashing (DNDCMH), which uses a binary vector specifying the presence of certain facial attributes as an input query to retrieve relevant face images from a database. The DNDCMH network consists of two separate components: an attribute-based deep cross-modal hashing (ADCMH) module, which uses a margin (m)-based loss function to efficiently learn compact binary codes to preserve similarity between modalities in the Hamming space, and a neural error correcting decoder (NECD), which is an error correcting decoder implemented with a neural network. The goal of NECD network in DNDCMH is to error correct the hash codes generated by ADCMH to improve the retrieval efficiency. The NECD network is trained such that it has an error correcting capability greater than or equal to the margin (m) of the margin-based loss function. This results in NECD can correct the corrupted hash codes generated by ADCMH up to the Hamming distance of m. We have evaluated and compared DNDCMH with state-of-the-art cross-modal hashing methods on standard datasets to demonstrate the superiority of our method.! DCMH [22], the inter-modal triplet embedding loss encourages the heterogeneous correlation across different modalities, and the intra-modal triplet loss encodes the discriminative power of the hash codes. Moreover, a regularization loss is used to apply adjacency consistency to ensure that the hash codes can keep the original similarities in Hamming space. However, in margin-based loss functions, some of the instances of different modalities of the same subject may not be close enough in Hamming space to guarantee all the correct retrievals. Therefore, it is important to bring the different modalities of the same subject closer to each other in Hamming space to improve the retrieval efficiency.In this work, we observe that in addition to the regular DCMH techniques [13], [24], [25], which exploit entropy maximization and quantization losses in the objective function of the DCMH, an error-correcting code (ECC) decoder can be used as an additional component to compensate for the heterogeneity gap and reduce the Hamming distance between the different modalities of the same subject in order to improve the cross-modal retrieval efficiency. We presume that the hash code generated by DCMH is a binary vector that is within a certain distance from a codeword of an ECC. When the hash code generated by DCMH is passed through an ECC decoder, the closest codeword to this hash code is found, which can be used as a final hash code for the retrieval process. In this process, the attribute hash code and image hash code of the same subject are forced to map to the same codeword, thereby reducing the distance of the corresponding hash codes. This brings more relevant facial images ...

show abstract

“…Nagrani et al [37] demonstrated that a joint representation can be learned from facial and voice information and introduced a curriculum learning strategy [3,45,46] to perform hard negative mining during training. Text-to-image matching is a well-studied problem in computer vision [4,19,26,38,49,51,59,60,66,68] facilitated by datasets describing objects, birds, or flowers [29,44,67]. A relatively new application of text-to-image matching is person search the task of which is to retrieve the most relevant frames of an individual given a textual description as an input.…”

Section: Related Workmentioning

confidence: 99%

Adversarial Representation Learning for Text-to-Image Matching

Sarafianos

Kakadiaris

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

166

View full text Add to dashboard Cite

For many computer vision applications such as image captioning, visual question answering, and person search, learning discriminative feature representations at both image and text level is an essential yet challenging problem. Its challenges originate from the large word variance in the text domain as well as the difficulty of accurately measuring the distance between the features of the two modalities. Most prior work focuses on the latter challenge, by introducing loss functions that help the network learn better feature representations but fail to account for the complexity of the textual input. With that in mind, we introduce TIMAM: a Text-Image Modality Adversarial Matching approach that learns modality-invariant feature representations using adversarial and cross-modal matching objectives. In addition, we demonstrate that BERT, a publiclyavailable language model that extracts word embeddings, can successfully be applied in the text-to-image matching domain. The proposed approach achieves state-of-theart cross-modal matching performance on four widely-used publicly-available datasets resulting in absolute improvements ranging from 2% to 5% in terms of rank-1 accuracy.

show abstract

Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval

Cited by 378 publications

References 42 publications

DistillHash: Unsupervised Deep Hashing by Distilling Data Pairs

DistillHash: Unsupervised Deep Hashing by Distilling Data Pairs

Error-Corrected Margin-Based Deep Cross-Modal Hashing for Facial Image Retrieval

Adversarial Representation Learning for Text-to-Image Matching

Contact Info

Product

Resources

About