Multi-task framework based on feature separation and reconstruction for cross-modal retrieval

Zhang, Li; Wu, Xiangqian

doi:10.1016/j.patcog.2021.108217

Cited by 15 publications

(2 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, text-based retrieval methods may not be able to capture the rich visual content in images or videos. By integrating multiple modalities, cross-modal retrieval helps to represent information more comprehensively and accurately, thereby improving the overall retrieval performance [11]. Figure 1 shows a basic schematic diagram of cross-modal retrieval.…”

Section: Relate Work a Cross-modal Retrievalmentioning

confidence: 99%

Multi-Granularity Semantic Information Integration Graph for Cross-Modal Hash Retrieval

Han,

Azman,

Khalid

et al. 2024

IEEE Access

View full text Add to dashboard Cite

With the development of intelligent collection technology and popularization of intelligent terminals, multi-source heterogeneous data are growing rapidly. The effective utilization of rich semantic information contained in massive amounts of multi-source heterogeneous data to provide users with highquality cross-modal information retrieval services has become an urgent problem to be solved in the current field of information retrieval. In this paper, we propose a novel cross-modal retrieval method, named MGSGH, which deeply explores the internal correlation between data of different granularities by integrating coarse-grained global semantic information and fine-grained scene graph information to model global semantic concepts and local semantic relationship graphs within a modality respectively. By enforcing crossmodal consistency constraints and intra-modal similarity preservation, we effectively integrate the visual features of image data and semantic information of text data to overcome the heterogeneity between the two types of data. Furthermore, we propose a new method for learning hash codes directly, thereby reducing the impact of quantization loss. Our comprehensive experimental evaluation demonstrated the effectiveness and superiority of the proposed model in achieving accurate and efficient cross-modal retrieval.

show abstract

Section: Relate Work a Cross-modal Retrievalmentioning

confidence: 99%

Multi-Granularity Semantic Information Integration Graph for Cross-Modal Hash Retrieval

Han,

Azman,

Khalid

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…LSH is also known as a negatives loss function, and it learns a fixed margin between the similarities of the relevant image-description embedding pairs and those of the irrelevant embedding pairs. A more recent loss function, the Max of Hinges Loss (LMH) [3], is adopted in most recent VSE networks, due to its ability to outperform LSH [20,21]. An improved version of LSH, LMH only focuses on learning the hard negatives, which are the irrelevant image-description embedding pairs that are nearer to the relevant image-description embedding pairs.…”

Section: Introductionmentioning

confidence: 99%

Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval

Gong¹,

Cosma²

2022

Preprint

View full text Add to dashboard Cite

Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval. Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image-description embedding pairs. However, the objective margin in the hard negatives loss function is set as a fixed hyperparameter that ignores the semantic differences of the irrelevant image-description pairs. To address the challenge of measuring the optimal similarities between image-description pairs before obtaining the trained VSE networks, this paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically enhanced hard negatives loss function, where the learning objective is dynamically determined based on the optimal similarity scores between irrelevant image-description pairs. Extensive experiments were carried out by integrating the proposed methods into five state-of-the-art VSE networks that were applied to three benchmark datasets for cross-modal information retrieval tasks. The results revealed that the proposed methods achieved the best performance and can also be adopted by existing and future VSE networks.

show abstract

CLVIN: Complete language-vision interaction network for visual question answering

Chen¹,

Han²,

Shen³

2023

Knowledge-Based Systems

View full text Add to dashboard Cite

Multi-task framework based on feature separation and reconstruction for cross-modal retrieval

Cited by 15 publications

References 9 publications

Multi-Granularity Semantic Information Integration Graph for Cross-Modal Hash Retrieval

Multi-Granularity Semantic Information Integration Graph for Cross-Modal Hash Retrieval

Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval

CLVIN: Complete language-vision interaction network for visual question answering

Contact Info

Product

Resources

About