Cross-modal Graph Matching Network for Image-text Retrieval

Cheng, Yuhao; Zhu, Xun; Qian, Jiuchao; Wen, Fei; Liu, Peilin

doi:10.1145/3499027

Cited by 56 publications

(25 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To validate the efficiency of our proposed FB-Net, we compare it with several state-of-the-art methods, in which seven non-DNN-based cross-modal retrieval methods (i.e., CCA [3], CMCP [23], JRL [25], JFSSL [26], and S 2 2UPG [27]) and nine DNN-based methods (i.e., DCCA [7], CCL [12], SCAN [15], GXN [28], VSESC [29], MAVA [30], SGRAF [31],SCL [41],CGMN [42],NAAF [32], and VSRN++ [33]) are contained. Note that the comparison methods are implemented using the authors' public source codes and are enumerated as follows.…”

Section: Compared Methodsmentioning

confidence: 99%

See 1 more Smart Citation

FB-Net: Dual-Branch Foreground-Background Fusion Network With Multi-Scale Semantic Scanning for Image-Text Retrieval

Liu

Pei

et al. 2023

IEEE Access

View full text Add to dashboard Cite

As a fundamental branch in cross-modal retrieval, image-text retrieval is still a challenging problem largely due to the complementary and imbalanced relationship between different modalities. However, existing works have not effectively scanned and aligned the semantic units distributed in different granularities of images and texts. To address these issues, we propose a dual-branch foreground-background fusion network (FB-Net), which is implemented by fully exploring and fusing the complementarity in semantic units collected from the foreground and background areas of instances (e.g., images and texts). Firstly, to generate multi-granularity semantic units from images and texts, multi-scale semantic scanning is conducted on both foreground and background areas through multi-level overlapped sliding windows. Secondly, to align semantic units between images and texts, the stacked cross-attention mechanism is used to calculate the initial image-text similarity. Thirdly, to further adaptively optimize the image-text similarity, the dynamically self-adaptive weighted loss is designed. Finally, to perform the image-text retrieval, the similarities between multi-granularity foreground and background semantic units are fused to obtain the final image-text similarity. Experimental results show that our proposed FB-Net outperforms representative state-of-the-art methods for image-text retrieval, and ablation studies further verify the effectiveness of each component in FB-Net.

show abstract

Section: Compared Methodsmentioning

confidence: 99%

“…• CGMN [42] uses graph convolutional networks to investigate the intra-relation in images and sentences and accomplishes interrelation reasoning between regions and words without impacting search efficiency.…”

Section: Compared Methodsmentioning

confidence: 99%

FB-Net: Dual-Branch Foreground-Background Fusion Network With Multi-Scale Semantic Scanning for Image-Text Retrieval

Liu

Pei

et al. 2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Baseline and comparative methods: Basic cross-modal initial retrieval methods [3], [6], [7], [8], [10], [11], [34], [35], [36], [37], [38], [39] were used as baseline methods. By comparing these baseline methods and our method with them, we confirm that our re-ranking method can improve the initial retrieval performance.…”

Section: Microsoft Common Objects In Context (Mscoco) [33]mentioning

confidence: 99%

Recallable Question Answering-Based Re-Ranking Considering Semantic Region for Cross-Modal Retrieval

Yanagi

Togo

Ogawa

et al. 2023

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

Question answering (QA)-based re-ranking methods for cross-modal retrieval have been recently proposed to further narrow down similar candidate images. The conventional QA-based re-ranking methods provide questions to users by analyzing candidate images, and the initial retrieval results are re-ranked based on the user's feedback. Contrary to these developments, only focusing on performance improvement makes it difficult to efficiently elicit the user's retrieval intention. To realize more useful QA-based re-ranking, considering the user interaction for eliciting the user's retrieval intention is required. In this paper, we propose a QA-based re-ranking method with considering two important factors for eliciting the user's retrieval intention: query-image relevance and recallability. Considering the query-image relevance enables to only focus on the candidate images related to the provided query text, while, focusing on the recallability enables users to easily answer the provided question. With these procedures, our method can efficiently and effectively elicit the user's retrieval intention. Experimental results using Microsoft Common Objects in Context and computationally constructed dataset including similar candidate images show that our method can improve the performance of the cross-modal retrieval methods and the QA-based re-ranking methods.

show abstract

“…In addition to a single attention module, Song and Soleymani [34] utilized a multi-head self-attention network to exploit polysemous meanings. In addition, the graph convolutional network (GCN) has been employed in several methods to consider the relationship between local features, and these methods demonstrated good performance [35], [36].…”

Section: A Cross-modal Retrievalmentioning

confidence: 99%

“…For evaluating the effectiveness of our sentence-based semantic loss function, we introduce our loss to the training of recently proposed cross-modal image retrieval methods [31], [34], [35], [36]. We compared the cross-modal retrieval methods with our loss and the original ones.…”

Section: ) Implementation Detailsmentioning

confidence: 99%

Cross-Modal Image Retrieval Considering Semantic Relationships With Many-to-Many Correspondence Loss

et al. 2023

View full text Add to dashboard Cite

A cross-modal image retrieval that explicitly considers semantic relationships between images and texts is proposed. Most conventional cross-modal image retrieval methods retrieve the target images by directly measuring the similarities between the candidate images and query texts in a common semantic embedding space. However, such methods tend to focus on a one-to-one correspondence between a predefined image-text pair during the training phase, and other semantically similar images and texts are ignored. By considering the many-to-many correspondences between semantically similar images and texts, a common embedding space is constructed to assure semantic relationships, which allows users to accurately find more images that are related to the input query texts. Thus, in this paper, we propose a cross-modal image retrieval method that considers semantic relationships between images and texts. The proposed method calculates the similarities between texts as semantic similarities to acquire the relationships. Then, we introduce a loss function that explicitly constructs the many-to-many correspondences between semantically similar images and texts from their semantic relationships. We also propose an evaluation metric to assess whether each method can construct an embedding space considering the semantic relationships. Experimental results demonstrate that the proposed method outperforms conventional methods in terms of this newly proposed metric.INDEX TERMS Cross-modal image retrieval, many-to-many correspondences, multimedia information retrieval, semantic similarity.

show abstract

Cross-modal Graph Matching Network for Image-text Retrieval

Cited by 56 publications

References 38 publications

FB-Net: Dual-Branch Foreground-Background Fusion Network With Multi-Scale Semantic Scanning for Image-Text Retrieval

FB-Net: Dual-Branch Foreground-Background Fusion Network With Multi-Scale Semantic Scanning for Image-Text Retrieval

Recallable Question Answering-Based Re-Ranking Considering Semantic Region for Cross-Modal Retrieval

Cross-Modal Image Retrieval Considering Semantic Relationships With Many-to-Many Correspondence Loss

Contact Info

Product

Resources

About