Cross-modal retrieval has drawn wide interest for retrieval across different modalities of data (such as text, image, video, audio and 3D model). However, existing methods based on deep neural network (DNN) often face the challenge of insufficient cross-modal training data, which limits the training effectiveness and easily leads to overfitting. Transfer learning is usually adopted for relieving the problem of insufficient training data, but it mainly focuses on knowledge transfer only from large-scale datasets as single-modal source domain (such as ImageNet) to single-modal target domain. In fact, such large-scale single-modal datasets also contain rich modal-independent semantic knowledge that can be shared across different modalities. Besides, large-scale cross-modal datasets are very labor-consuming to collect and label, so it is significant to fully exploit the knowledge in singlemodal datasets for boosting cross-modal retrieval. To achieve this goal, this paper proposes modal-adversarial hybrid transfer network (MHTN), which to the best of our knowledge is the first work to realize knowledge transfer from single-modal source domain to cross-modal target domain, and learn cross-modal common representation. It is an end-to-end architecture with two subnetworks: (1) Modal-sharing knowledge transfer subnetwork is proposed to jointly transfer knowledge from a large-scale singlemodal dataset in source domain to all modalities in target domain with a star network structure, which distills modal-independent supplementary knowledge for promoting cross-modal common representation learning. (2) Modal-adversarial semantic learning subnetwork is proposed to construct an adversarial training mechanism between common representation generator and modality discriminator, making the common representation discriminative for semantics but indiscriminative for modalities to enhance crossmodal semantic consistency during transfer process. Comprehensive experiments on 4 widely-used datasets show its effectiveness and generality.
Cross-modal hashing aims to map heterogeneous multimedia data into a common Hamming space, which can realize fast and flexible retrieval across different modalities. Supervised cross-modal hashing methods have achieved considerable progress by incorporating semantic side information. However, they mainly have two limitations: (1) Heavily rely on large-scale labeled cross-modal training data which are labor intensive and hard to obtain, since multiple modalities are involved.(2) Ignore the rich information contained in the large amount of unlabeled data across different modalities, especially the margin examples from another modality that are easily to be incorrectly retrieved, which can help to model the correlations between different modalities. To address these problems, in this paper we propose a novel Semi-supervised Cross-Modal Hashing approach by Generative Adversarial Network (SCH-GAN). We aim to take advantage of GAN's ability for modeling data distributions, so that SCH-GAN can model the distribution across different modalities, and select informative margin examples from large amount of unlabeled data to promote cross-modal hashing learning in an adversarial way. The main contributions can be summarized as follows: (1) We propose a novel generative adversarial network for cross-modal hashing. In our proposed SCH-GAN, the generative model tries to select margin examples of one modality from unlabeled data when giving a query of another modality (e.g. giving a text query to retrieve images and vice versa). While the discriminative model tries to distinguish the selected examples and true positive examples of the query. These two models play a minimax game so that the generative model can promote the hashing performance of discriminative model.(2) We propose a reinforcement learning based algorithm to drive the training of proposed SCH-GAN. The generative model takes the correlation score predicted by discriminative model as a reward, and tries to select the examples close to the margin to promote discriminative model by maximizing the margin between positive and negative data. Extensive experiments compared with 8 state-of-the-art methods on 3 widely-used datasets verify the effectiveness of our proposed approach.
DNN-based cross-modal retrieval is a research hotspot to retrieve across different modalities as image and text, but existing methods often face the challenge of insufficient cross-modal training data. In single-modal scenario, similar problem is usually relieved by transferring knowledge from largescale auxiliary datasets (as ImageNet). Knowledge from such single-modal datasets is also very useful for cross-modal retrieval, which can provide rich general semantic information that can be shared across different modalities. However, it is challenging to transfer useful knowledge from single-modal (as image) source domain to cross-modal (as image/text) target domain. Knowledge in source domain cannot be directly transferred to both two different modalities in target domain, and the inherent cross-modal correlation contained in target domain provides key hints for cross-modal retrieval which should be preserved during transfer process. This paper proposes Cross-modal Hybrid Transfer Network (CHTN) with two subnetworks: Modalsharing transfer subnetwork utilizes the modality in both source and target domains as a bridge, for transferring knowledge to both two modalities simultaneously; Layer-sharing correlation subnetwork preserves the inherent cross-modal semantic correlation to further adapt to cross-modal retrieval task. Cross-modal data can be converted to common representation by CHTN for retrieval, and comprehensive experiments on 3 datasets show its effectiveness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.