Context-Aware Multi-View Summarization Network for Image-Text Matching

Qu, Leigang; Li, Meng; Cao, Da; Nie, Liqiang

doi:10.1145/3394171.3413961

Cited by 77 publications

(28 citation statements)

References 34 publications

(80 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Among the region-phrase-based methods, Niu et al [21] proposed a cross-modal attention model to align features from the two modalities at the global-to-global, global-to-local, and local-to-local levels in order to extract multi-granular features. However, these works require cross-modal operations for each image-text pair, which in- troduces a high computational cost [24]. Recently, Wang et al [32] proposed an approach that is free from cross-modal operations.…”

Section: Related Workmentioning

confidence: 99%

“…One popular cross-modal alignment strategy involves adopting attention models to acquire correspondences between body parts and words [17,16,2]. However, this strategy depends on cross-modal operations for each image-text pair, which are computationally expensive [24]. Another intuitive strategy involves splitting one textual description into several groups of noun phrases by using external tools, e.g.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification

Ding,

Shao

et al. 2021

Preprint

View full text Add to dashboard Cite

Text-to-image person re-identification (ReID) aims to search for images containing a person of interest using textual descriptions. However, due to the significant modality gap and the large intra-class variance in textual descriptions, text-to-image ReID remains a challenging problem. Accordingly, in this paper, we propose a Semantically Self-Aligned Network (SSAN) to handle the above problems. First, we propose a novel method that automatically extracts semantically aligned part-level features from the two modalities. Second, we design a multi-view nonlocal network that captures the relationships between body parts, thereby establishing better correspondences between body parts and noun phrases. Third, we introduce a Compound Ranking (CR) loss that makes use of textual descriptions for other images of the same identity to provide extra supervision, thereby effectively reducing the intra-class variance in textual features. Finally, to expedite future research in text-to-image ReID, we build a new database named ICFG-PEDES. Extensive experiments demonstrate that SSAN outperforms state-of-the-art approaches by significant margins. Both the new ICFG-PEDES database and the SSAN code are available at https://github. com/zifyloo/SSAN .

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification

Ding,

Shao

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Different from visual representation, text representation does not seem to have great differences. Most methods use the powerful pretrained language model Bert [5] to get text representation, and some methods [6,8,17,19,24,28] also use GRU [2,31].…”

Section: Textual Representationsmentioning

confidence: 99%

“…CAMERA [28] does not use a pair of image-text data for training but adds image-text joint training for multiview descriptions, and selects content information through an attention module, which also takes advantage of intra-modal interactions and inter-modal interactions (c). Although CAMERA also used a contrastive loss similar to previous works, CAMERA introduces a diversity regularization term that causes a difference in the loss term.…”

Section: Pretrained Modelsmentioning

confidence: 99%

Where Does the Performance Improvement Come From? -- A Reproducibility Concern about Image-Text Retrieval

Rao¹,

Wang²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper seeks to provide the information retrieval community with some reflections on the current improvements of retrieval learning through the analysis of the reproducibility aspects of image-text retrieval models. For the latter part of the past decade, image-text retrieval has gradually become a major research direction in the field of information retrieval because of the growth of multi-modal data. Many researchers use benchmark datasets like MS-COCO and Flickr30k to train and assess the performance of image-text retrieval algorithms. Research in the past has mostly focused on performance, with several state-of-the-art methods being proposed in various ways. According to their claims, these approaches achieve better modal interactions and thus better multimodal representations with greater precision. In contrast to those previous works, we focus on the repeatability of the approaches and the overall examination of the elements that lead to improved performance by pretrained and nonpretrained models in retrieving images and text.To be more specific, we first examine the related reproducibility concerns and why the focus is on image-text retrieval tasks, and then we systematically summarize the current paradigm of image-text retrieval models and the stated contributions of those approaches. Second, we analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models. Based on this, we conducted ablation experiments and obtained some influencing factors that affect retrieval recall more than the improvement claimed in the original paper. Finally, we also present some reflections and issues that should be considered by the retrieval community in the future. Our code is freely available at https://github.com/WangFei-2019/Image-text-Retrieval.

show abstract

“…the given image (visual question answering [2,9,13]), generating multimodal responses based on user intentions (multimodal task-oriented dialog [25]), or describing what they see with a natural sentence (image captioning [1,6,42,43,45,46]). With the development of deep learning techniques, there has been a steady momentum of breakthroughs that push the limits of visionlanguage tasks [32,44]. Despite having promising quantitative results, the achievements rely heavily on the requirement of large quantities of task-specific annotations (e.g., image-questionanswer triplets/image-sentence pairs) for such neural model learning.…”

Section: Introductionmentioning

confidence: 99%

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Yehao

Fan

Pan

et al. 2022

ACM Trans. Multimedia Comput. Commun. Appl.

View full text Add to dashboard Cite

Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.

show abstract

Context-Aware Multi-View Summarization Network for Image-Text Matching

Cited by 77 publications

References 34 publications

Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification

Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification

Where Does the Performance Improvement Come From? -- A Reproducibility Concern about Image-Text Retrieval

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Contact Info

Product

Resources

About