Cross-media retrieval: state-of-the-art and open issues

Jing, Liu; Xu, Changsheng; Lu, Hanqing

doi:10.1504/ijmis.2010.035970

Cited by 18 publications

(6 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Among these cross-modal techniques, cross-modal subspace learning methods have achieved state-of-the-art results in recent years [24,40,43,51,52], which have borrowed much inspiration from the conventional subspace approaches [53,54,55,56,57,58,59]. For a comprehensive survey, please refer to [60,61].…”

Section: Introductionmentioning

confidence: 99%

Cross-modal subspace learning for fine-grained sketch-based image retrieval

Yin

Huang

et al. 2018

Neurocomputing

View full text Add to dashboard Cite

Sketch-based image retrieval (SBIR) is challenging due to the inherent domain-gap between sketch and photo. Compared with pixel-perfect depictions of photos, sketches are iconic renderings of the real world with highly abstract. Therefore, matching sketch and photo directly using low-level visual clues are unsufficient, since a common low-level subspace that traverses semantically across the two modalities is non-trivial to establish. Most existing SBIR studies do not directly tackle this cross-modal problem. This naturally motivates us to explore the effectiveness of cross-modal retrieval methods in SBIR, which have been applied in the image-text matching successfully. In this paper, we introduce and compare a series of state-of-the-art cross-modal subspace learning methods and benchmark them on two recently released fine-grained SBIR datasets. Through thorough examination of the experimental results, we have demonstrated that the subspace learning can effectively model the sketch-photo domain-gap. In addition we draw a few key insights to drive future research.

show abstract

Section: Introductionmentioning

confidence: 99%

Cross-modal subspace learning for fine-grained sketch-based image retrieval

Yin

Huang

et al. 2018

Neurocomputing

View full text Add to dashboard Cite

show abstract

“…where X (1) and X (2) represent two modalities of data, and V represents the latent semantic representations. P (1) and P (2) are the learned projections.…”

Section: Linear Modelingmentioning

confidence: 99%

“…This paper aims to conduct a comprehensive survey of cross-modal retrieval. Although Liu et al [1] gave an overview of cross-modal retrieval in 2010, it does not include many important works proposed in recent years. Xu et al [2] summarize several methods for modeling multimodal data, but they focus on multi-view learning.…”

Section: Introductionmentioning

confidence: 99%

A Comprehensive Survey on Cross-modal Retrieval

Wang¹,

Yin²,

Wang³

et al. 2016

Preprint

View full text Add to dashboard Cite

In recent years, cross-modal retrieval has drawn much attention due to the rapid growth of multimodal data. It takes one type of data as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or videos. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different modalities of data remains a challenge. Various methods have been proposed to deal with such a problem. In this paper, we first review a number of representative methods for cross-modal retrieval and classify them into two main groups: 1) real-valued representation learning, and 2) binary representation learning. Real-valued representation learning methods aim to learn real-valued common representations for different modalities of data. To speed up the cross-modal retrieval, a number of binary representation learning methods are proposed to map different modalities of data into a common Hamming space. Then, we introduce several multimodal datasets in the community, and show the experimental results on two commonly used multimodal datasets. The comparison reveals the characteristic of different kinds of cross-modal retrieval methods, which is expected to benefit both practical applications and future research. Finally, we discuss open problems and future research directions.

show abstract

“…The common solution to understand the relationship between image and text is to map the visual semantic embeddings 9, 10 of an image and the corresponding words, phrases, sentences into a common latent embedding space 9,[11][12][13][14][15][16] . In these methods, the goal is generally to find a common space in which the corresponding representations of image-text pairs are as close as possible, hence making the recognition of their relationship easier.…”

Section: Introductionmentioning

confidence: 99%

Harmonizing the Scale: An End-to-End Self-Supervised Framework for Cross-Modal Search and Retrieval in Histopathology Archives

Tizhoosh,

Maleki,

Rahnamayan

2023

Preprint

View full text Add to dashboard Cite

The exponential growth of data across various medical domains has generated a substantial demand for techniques to analyze multimodal big data. This demand is particularly pronounced in fields such as computational pathology due to the diverse nature of the tissue. Cross-modal retrieval aims to identify a common latent space where different modalities, such as image-text pairs, exhibit close alignment. The primary challenge, however, often lies in the representation of tissue features. While language models can be trained relatively easily, visual models frequently struggle due to the scarcity of labeled data. To address this issue, the innovative concept of harmonization has been introduced, extending the learning scheme distillation without supervision, known as DINO. The harmonization of scale refines the DINO paradigm through a novel patching approach, overcoming the complexities posed by gigapixel whole slide images in digital pathology. Experiments conducted on diverse datasets have demonstrated that the proposed approach significantly enhances cross-modal retrieval in tissue imaging. Moreover, it exhibits vast potential for other fields that rely on gigapixel imaging.

show abstract

Cross-media retrieval: state-of-the-art and open issues

Cited by 18 publications

References 52 publications

Cross-modal subspace learning for fine-grained sketch-based image retrieval

Cross-modal subspace learning for fine-grained sketch-based image retrieval

A Comprehensive Survey on Cross-modal Retrieval

Harmonizing the Scale: An End-to-End Self-Supervised Framework for Cross-Modal Search and Retrieval in Histopathology Archives

Contact Info

Product

Resources

About