Attend, Correct And Focus: A Bidirectional Correct Attention Network For Image-Text Matching

Liu, Yang; Wang, Huaqiu; Meng, Fanyang; Liu, Mengyuan; Liu, Hong

doi:10.1109/icip42928.2021.9506438

Cited by 28 publications

(57 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SCAN has been used as a baseline for many methods and has led to technological developments since its proposal. Examples include the bidirectional focal attention network (BFAN) [15] and the position focused attention network (PFAN) [16]. In BFAN, irrelevant image regions and words cause deterioration in the correspondence between images and text; thus, they are removed.…”

Section: B Methods For Local Image-text Matchingmentioning

confidence: 99%

Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval

Ueki¹

2021

Preprint

View full text Add to dashboard Cite

Visual-semantic embedding is an interesting research topic because it is useful for various tasks, such as visual question answering (VQA), image-text retrieval, image captioning, and scene graph generation. In this paper, we focus on zero-shot image retrieval using sentences as queries and present a survey of the technological trends in this area. First, we provide a comprehensive overview of the history of the technology, starting with a discussion of the early studies of image-to-text matching and how the technology has evolved over time. In addition, a description of the datasets commonly used in experiments and a comparison of the evaluation results of each method are presented. We also introduce the implementation available on github for use in confirming the accuracy of experiments and for further improvement. We hope that this survey paper will encourage researchers to further develop their research on bridging images and languages.

show abstract

Section: B Methods For Local Image-text Matchingmentioning

confidence: 99%

Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval

Ueki¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Most text-based image retrieval approaches are based on deep neural networks [38,16,18,10,34,5]. The main ob-jective of the retrieval system is to accurately measure the similarity between the inputs from two different modalities.…”

Section: Text-based Image Retrievalmentioning

confidence: 99%

“…Cross-Modal Projection Learning (CMPL) [38] is proposed to pull image and text embeddings into an aligned space. To further enhance the retrieval performance in a fine-grained way, [16,18,10,34] proposed different attention-based approaches, applying visual attention between every image region and word.…”

Section: Text-based Image Retrievalmentioning

confidence: 99%

“…Whereas the others consist of different objects like trees and buses. If the retrieval model receives a complete description including all objects, existing methods [16,18,34] would perform excellently. Due to the partial query focuses only on a single region therefore lacking differentiating information, many scenes may fit its description thus become false positives.…”

Section: Problem Statementmentioning

confidence: 99%

“…However, we discover that some cost-free information is always ignored. In detail, most text-image matching methods [16,18,34] extract image features by an object detector [1], but they never leverage the ready-made information of object categories. Another discovery is that the distribution of object categories is not uniform.…”

Section: Discovery and Assumptionmentioning

confidence: 99%

See 2 more Smart Citations

Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval with Partial Query

Cai,

Zhang,

Jiang

et al. 2021

Preprint

View full text Add to dashboard Cite

Text-based image retrieval has seen considerable progress in recent years. However, the performance of existing methods suffers in real life since the user is likely to provide an incomplete description of a complex scene, which often leads to results filled with false positives that fit the incomplete description. In this work, we introduce the partialquery problem and extensively analyze its influence on textbased image retrieval. We then propose an interactive retrieval framework called Part2Whole to tackle this problem by iteratively enriching the missing details. Specifically, an Interactive Retrieval Agent is trained to build an optimal policy to refine the initial query based on a user-friendly interaction and statistical characteristics of the gallery. Compared to other dialog-based methods that rely heavily on the user to feed back differentiating information, we let AI take over the optimal feedback searching process and hint the user with confirmation-based questions about details. Furthermore, since fully-supervised training is often infeasible due to the difficulty of obtaining human-machine dialog data, we present a weakly-supervised reinforcement learning method that needs no human-annotated data other than the text-image dataset. Experiments show that our framework significantly improves the performance of text-based image retrieval under complex scenes.

show abstract