MHTN: Modal-Adversarial Hybrid Transfer Network for Cross-Modal Retrieval

Huang, Xin; Peng, Yuxin; Yuan, Mingkuan

doi:10.1109/tcyb.2018.2879846

Cited by 103 publications

(77 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…I→T I→A I→V T→I T→A T→V A→I A→T A→V V→I V→T V→A Average Our FGCrossNet 0.210 0.526 0.606 0.255 0.181 0.208 0.553 0.159 0.443 0.629 0.195 0.437 0.366 MHTN [20] 0.116 0.195 0.281 0.124 0.138 0.185 0.196 0.127 0.290 0.306 0.186 0.306 0.204 ACMR [21] 0.162 0.119 0.477 0.075 0.015 0.081 0.128 0.028 0.068 0.536 0.138 0.111 0.162 JRL [22] 0.160 0.085 0.435 0.190 0.028 0.095 0.115 0.035 0.065 0.517 0.126 0.068 0.160 GSPH [23] 0.140 0.098 0.413 0.179 0.024 0.109 0.129 0.024 0.073 0.512 0.126 0.086 0.159 CMDN [24] 0.099 0.009 0.377 0.123 0.007 0.078 0.017 0.008 0.010 0.446 0.081 0.009 0.105 SCAN [25] 0.050…”

Section: Methodsunclassified

A New Benchmark and Approach for Fine-grained Cross-media Retrieval

Peng

Xie

2019

Proceedings of the 27th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Cross-media retrieval is to return the results of various media types corresponding to the query of any media type. Existing researches generally focus on coarse-grained cross-media retrieval. When users submit an image of "Slaty-backed Gull" as a query, coarsegrained cross-media retrieval treats it as "Bird", so that users can only get the results of "Bird", which may include other bird species with similar appearance (image and video), descriptions (text) or sounds (audio), such as "Herring Gull". Such coarse-grained crossmedia retrieval is not consistent with human lifestyle, where we generally have the fine-grained requirement of returning the exactly relevant results of "Slaty-backed Gull" instead of "Herring Gull". However, few researches focus on fine-grained cross-media retrieval, which is a highly challenging and practical task. Therefore, in this paper, we first construct a new benchmark for fine-grained cross-media retrieval, which consists of 200 fine-grained subcategories of the "Bird", and contains 4 media types, including image, text, video and audio. To the best of our knowledge, it is the first benchmark with 4 media types for fine-grained cross-media retrieval. Then, we propose a uniform deep model, namely FGCross-Net, which simultaneously learns 4 types of media without discriminative treatments. We jointly consider three constraints for better common representation learning: classification constraint ensures the learning of discriminative features for fine-grained subcategories, center constraint ensures the compactness characteristic of the features of the same subcategory, and ranking constraint ensures the sparsity characteristic of the features of different subcategories. Extensive experiments verify the usefulness of the new benchmark and the effectiveness of our FGCrossNet. The new benchmark and the source code of FGCrossNet will be made available at https://github.com/PKU-ICST-MIPL/FGCrossNet_ACMMM2019. CCS CONCEPTS• Information systems → Multimedia and multimodal retrieval; • Computing methodologies → Artificial intelligence.

show abstract

Section: Methodsunclassified

A New Benchmark and Approach for Fine-grained Cross-media Retrieval

Peng

Xie

2019

Proceedings of the 27th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

“…On the other hand, the cross-modal hashing methods mainly focus on the retrieval efficiency by mapping the items of different modalities into a common binary Hamming space. Benefited from the strong ability of distribution modeling and discriminative representation learning, some recent crossmodal retrieval methods have collaborated with GAN models [9,10,2]. In this work, our method also follows the similar adversarial learning framework that uses the single-modal similarities to guide the cross-modal representation learning.…”

Section: Related Workmentioning

confidence: 99%

“…The ACMR [2] method proposes the triplet loss and the modality classifier for preserving the modality level semantic struc-tures. The MHTN [10] is proposed to minimize the maximum mean discrepancy between modalities, which preserves more flexibility for the generator to project vectors into a new space. The difference between CMST and the previous work is that CMST can learn the item-level semantic relationships between unpaired items in an unsupervised way.…”

Section: Related Workmentioning

confidence: 99%

Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities

Wen

Han

Yin

et al. 2019

2019 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

Cross-modal retrieval aims to retrieve relevant data across different modalities (e.g., texts vs. images). The common strategy is to apply element-wise constraints between manually labeled pair-wise items to guide the generators to learn the semantic relationships between the modalities, so that the similar items can be projected close to each other in the common representation subspace. However, such constraints often fail to preserve the semantic structure between unpaired but semantically similar items (e.g. the unpaired items with the same class label are more similar than items with different labels). To address the above problem, we propose a novel cross-modal similarity transferring (CMST) method to learn and preserve the semantic relationships between unpaired items in an unsupervised way. The key idea is to learn the quantitative similarities in single-modal representation subspace, and then transfer them to the common representation subspace to establish the semantic relationships between unpaired items across modalities. Experiments show that our method outperforms the state-of-the-art approaches both in the class-based and pair-based retrieval tasks.

show abstract

“…Until now, these embeddings have been learned in a static manner, i.e. without preserving the time dimension, and thus ignoring the evolution of modality interactions [5,7,12,20,21,23,27,30,31,35,36]. Approaches have ranged from solutions that organize the space according to linear correlations [23,33,36] (image and texts cooccurrence), semantic [20,30,34,35] (category information) and/or temporal correlations [27].…”

Section: Introductionmentioning

confidence: 99%

Diachronic Cross-modal Embeddings

Semedo

Magalhães

2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Understanding the semantic shifts of multimodal information is only possible with models that capture cross-modal interactions over time. Under this paradigm, a new embedding is needed that structures visual-textual interactions according to the temporal dimension, thus, preserving data's original temporal organisation. This paper introduces a novel diachronic cross-modal embedding (DCM), where cross-modal correlations are represented in embedding space, throughout the temporal dimension, preserving semantic similarity at each instant t. To achieve this, we trained a neural cross-modal architecture, under a novel ranking loss strategy, that for each multimodal instance, enforces neighbour instances' temporal alignment, through subspace structuring constraints based on a temporal alignment window. Experimental results show that our DCM embedding successfully organises instances over time. Quantitative experiments, confirm that DCM is able to preserve semantic cross-modal correlations at each instant t while also providing better alignment capabilities. Qualitative experiments unveil new ways to browse multimodal content and hint that multimodal understanding tasks can benefit from this new embedding.

show abstract

MHTN: Modal-Adversarial Hybrid Transfer Network for Cross-Modal Retrieval

Cited by 103 publications

References 44 publications

A New Benchmark and Approach for Fine-grained Cross-media Retrieval

A New Benchmark and Approach for Fine-grained Cross-media Retrieval

Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities

Diachronic Cross-modal Embeddings

Contact Info

Product

Resources

About