IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Chen, Hui; Ding, Guiguang; Li, Xudong; Lin, Zijia; Ji, Liu; Han, Jungong

doi:10.1109/cvpr42600.2020.01267

Cited by 266 publications

(157 citation statements)

References 19 publications

Supporting

Mentioning

135

Contrasting

Order By: Relevance

“…For instance, BFAN is proposed to eliminate partial irrelevant words and regions from the shared semantic in imagetext pairs to achieves state-of-the-art performance on several benchmark datasets. IMRAM (Chen et al, 2020) proposes a recurrent attention memory which incorporates a cross-modal attention unit and a memory distillation unit to refine the correspondence between image regions and text words. However, those attention mechanisms used by the manyto-many matching methods are usually complicated with high computation complexity.…”

Section: Text-image Matchingmentioning

confidence: 99%

“…Due to this advantage, a mount of crossmodal hashing methods have been proposed Su et al, 2019;Lin et al, 2020;Tu et al, 2020;Shi et al, 2019). For example, SDCH (Lin et al, 2020) utilizes a semantic label branches to preserve semantic information of the learned features by integrating with inter-modal pairwise loss, cross-entropy loss and quantization loss.…”

Section: Cross-modal Hashingmentioning

confidence: 99%

See 1 more Smart Citation

Hashing based Efficient Inference for Image-Text Matching

Tu¹,

Ji²,

Luo³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Image-text matching has been a popular research topic which bridges vision and language through semantic understanding. Recent works mainly focus on exploring the interactions between images and sentences to improve the performance without considering inference efficiency. Specifically, for the large scale databases, it is unacceptable to perform such time-consuming mechanisms between a query (text/image) and each candidate datapoint (image/text) in the whole retrieval set during inference. To tackle this problem, we propose a novel hashing based efficient inference module called HEI, which can be plugged into the existing framework to speed up the inference step without reducing the retrieval performance. In details, HEI learns to map the original datapoints into short binary hash codes and coarsely preserve the heterologous matching relationship. Thus, in the inference phase, the proposed HEI module uses the hash codes to quickly select a few candidate datapoints from the retrieval set for a given query. Then, the image-text matching model fine ranks the candidate set to find the matching datapoint. Extensive experiments on two widely used benchmark MS-COCO and Flickr30k with four baseline methods demonstrate the efficiency and effectiveness of our proposed HEI module.

show abstract

Section: Text-image Matchingmentioning

confidence: 99%

Section: Cross-modal Hashingmentioning

confidence: 99%

Hashing based Efficient Inference for Image-Text Matching

Tu¹,

Ji²,

Luo³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…[3,7,10] targeted at fine-grained alignment between images and texts by attention mechanism or gated-fusion strategy. Based on recurrent memory cell stacking, [2,9] could enhance visual representation and conduct semantic reasoning to preserve salient information. Wang et.al [8] concentrated on the object similarity and relation similarity in virtue of scene graph structure.…”

Section: Related Workmentioning

confidence: 99%

“…Image-Text matching [1,2,3] is one representative task in cross-modal learning, aiming at retrieving the most relevant instances of one modality (e.g., image) given a query from another modality (e.g., text). The essence of image-text retrieval consists in the evaluation of semantic similarity between image and text modalities, which remains to be a challenge due to the heterogeneous gap in cross-modal data distribution.…”

Section: Introductionmentioning

confidence: 99%

“…To the best of our knowledge, it is the first hybrid architecture that integrates early fusion with late fusion models for image-text retrieval. (2) The early fusion module is implemented in a simple and light structure, which concentrates on the incorporation of local visual regions and sentence semantics. It is flexible to be integrated into the offthe-shelf image-sentence matching approaches based on late fusion schemes.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Combine Early and Late Fusion Together: A Hybrid Fusion Framework for Image-Text Matching

Wang

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

Image-text matching is a challenging task in cross-modal learning due to the discrepancy of data representation between different modalities of images and texts. The mainstream methods adopt the late fusion to generate image-text similarity on encoded cross-modal features, and put effort to capture intra-modality associations with considerably high training cost. In this work, we propose to Combine Early and Late Fusion Together (CELFT), which is a universal hybrid fusion framework that can effectively overcome the above shortcomings of the late fusion scheme. In the proposed CELFT framework, the hybrid structure with early fusion and late fusion could facilitate the interaction between image and text modalities at early stage. Moreover, these two kinds of fusion strategies complement each other in capturing the inter-modal and intra-modal information, which ensure to learn more accurate image-text similarity. In the experiments, we choose four latest approaches based on the late fusion scheme as the base models, and integrate them with our CELFT framework. The results on two widely used imagetext datasets MSCOCO and Flickr30K show that the matching performance of all base models is significantly improved with remarkably reduced training time.

show abstract

Cross‐modal person re‐identification based on deep attention hash learning

Zhang,

Cao,

Zhang

et al. 2023

Concurrency and Computation

View full text Add to dashboard Cite

SummaryPerson re‐identification based on text description is a critical task in modern security systems. However, existing methods primarily focus on performance and overlook the crucial aspect of retrieval efficiency. In this article, we propose a novel two‐stage multimodal re‐discovery algorithm called DCH‐ReID, which leverages attention hashing. First, we introduce a chunked mapping hash learning method that effectively mitigates confusion between hash codes. Second, we propose a hash learning approach based on the channel attention mechanism, assigning higher binary bit weights to important body parts. Finally, to balance retrieval performance and time efficiency, we present a two‐stage retrieval scheme. Through extensive experiments on the CUHK‐PEDES benchmark dataset, we validate that our proposed DCH‐ReID algorithm exhibits superior efficiency and higher accuracy compared to current mainstream text‐based pedestrian re‐identification algorithms.

show abstract

IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Cited by 266 publications

References 19 publications

Hashing based Efficient Inference for Image-Text Matching

Hashing based Efficient Inference for Image-Text Matching

Combine Early and Late Fusion Together: A Hybrid Fusion Framework for Image-Text Matching

Cross‐modal person re‐identification based on deep attention hash learning

Contact Info

Product

Resources

About