2021
DOI: 10.48550/arxiv.2103.11920
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval

Abstract: Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. While offering unmatched retrieval performance, such models: 1) are typically pretrained from scratch and thus less scalable, 2) suffer from huge retrieval latency and inefficiency issues, which makes them impractical in realistic applications. To address these crucial gaps towards both … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(7 citation statements)
references
References 34 publications
0
7
0
Order By: Relevance
“…There are several frameworks for image retrieval in the literature, starting from tag-based matching [56] to state-of-the-art vision-language transformers [58,63]. For the purpose of this paper, we used a Multi-Modal Transformer (MMT) [37] based text-image retrieval model. This model consists of two components: a fast (although somewhat lower quality) retrieval step that identifies a large set of relevant images, followed by a re-ranking step that selects the best images from the retrieved set.…”
Section: Imagementioning
confidence: 99%
See 1 more Smart Citation
“…There are several frameworks for image retrieval in the literature, starting from tag-based matching [56] to state-of-the-art vision-language transformers [58,63]. For the purpose of this paper, we used a Multi-Modal Transformer (MMT) [37] based text-image retrieval model. This model consists of two components: a fast (although somewhat lower quality) retrieval step that identifies a large set of relevant images, followed by a re-ranking step that selects the best images from the retrieved set.…”
Section: Imagementioning
confidence: 99%
“…For our experiments, we develop an image search engine that uses a state-of-the-art MultiModal Transformer (MMT) [37] retrieval model and a fair re-ranking algorithm (FMMR [51]) that aims to achieve demographic group fairness on the ranked list of image Figure 1: A diagram showing our attack approach. (a) shows example search results from an image search engine for the query "tennis player".…”
Section: Introductionmentioning
confidence: 99%
“…In many case, the design is driven by the intended downstream tasks (e.g., VQA requires earlier fusion to enhance joint representation whereas cross-modal retrieval requires later fusion to speed up inference). There are also efforts for alleviating the gap between different architectures through retrieve-and-rerank strategy [56,19] or knowledge distillation [65,41]. Unlike them, inspired by the recent advances in modality-agnostic models [1,71,64,63,35], we introduce a unified architecture that can be easily switched between the single-stream or two-stream mode, so there is no need to modify the architecture for different downstream tasks.…”
Section: Related Workmentioning
confidence: 99%
“…Based CMR methods. With the success of DNN models, which allows to extend to the multimodal learning by supplying scalable nonlinear transformations for valid cross-modal representation learning [8,9,14,15,23,24,28,33,35,49,62,69]. Deep learning has already succeed in the representation learning for single modality, such as text, image, and audio, bridging the heterogeneous gap across modalities is still a big challenge for deep learning methods.…”
Section: Deep Neural Network (Dnn)mentioning
confidence: 99%
“…DeViSE [14] is to solve the problem of large numbers of object categories that is hard to recognise without enough training data, by introducing a combination model encompassing visual and semantic embedding learning with image data and its unstructured text information. The cooperative and joint methods for promoting the CMR [15] proposed a novel framework for the limitation of current CMR methods less scalable and huge retrieval cost, by combining a twin networks to encode all data for efficient initial retrieval and a cross-encoder structure for retrieved same unit of data. The 𝐶 2 𝑀𝐿𝑅 [23] model is to discover better representations and optimize pairwise rank function by enhancing both local alignment and global alignment, in detail, local alignment is about the alignment within the visual object and textual words, and global alignment is about the visual high level and semantic high level alignment.…”
Section: Deep Neural Network (Dnn)mentioning
confidence: 99%