See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval

Shu, Xiujun; Wen, Wei; Wu, Haoqian; Chen, Keyu; Song, Yiran; Qiao, Ruizhi; Ren, Bo; Wang, Xiao

doi:10.1007/978-3-031-25072-9_42

Cited by 32 publications

(7 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ViTAA 55.97 75.84 83.52 -SSAN (Ding et al 2021) 61.37 80.15 86.73 -LapsCore (Wu et al 2021) 63.40 -87.80 -LGUR (Shao et al 2022) 65.25 83.12 89.00 -SAF (Li, Cao, and Zhang 2022) 64.13 82.62 88.40 58.61 IVT (Shu et al 2023) 65.59 83.11 89.21 60.66 RaSa (Bai et al 2023a) 76 Adapter) are even worse in performance. The three methods are skilled in the few-shot image classification since the large-scale data knowledge of CLIP from the pre-training phase shares the same data characteristics as the downstream classification task.…”

Section: Methodsmentioning

confidence: 99%

An Empirical Study of CLIP for Text-Based Person Search

Cao,

Bai,

Zeng

et al. 2024

AAAI

View full text Add to dashboard Cite

Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the aforementioned designs and practical training tricks, can attain satisfactory performance without any sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in model generalization and model compression, demonstrating the effectiveness of TBPS-CLIP from various aspects. This work is expected to provide empirical insights and highlight future CLIP-based TBPS research.

show abstract

Section: Methodsmentioning

confidence: 99%

An Empirical Study of CLIP for Text-Based Person Search

Cao,

Bai,

Zeng

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…The pre-training and fine-tuning paradigm has achieved great success, which drives the development of CV (Dosovitskiy et al 2020) and natural language processing (NLP) (Brown et al 2020). Many efforts (Yan et al 2022;Radford et al 2021;Ma et al 2022;Yao et al 2021;Cao et al 2022;Fang et al 2021;Shu et al 2022) have attempted to extend the pre-training model to the multimodal field. It is gratifying that visual language pre-training (VLP) has attracted growing attention.…”

Section: Vision-language Pre-training Modelsmentioning

confidence: 99%

“…As a leading pre-training model, different from the traditional single-modality supervised pre-training model, CLIP leverages natural text descriptions to supervise the learning. Since the great advantage of CLIP, a lot of follow-ups (Luo et al 2022;Fang et al 2021;Ma et al 2022;Zhao et al 2022;Shu et al 2022;Han et al 2021;Yan et al 2022) have also begun to transfer the knowledge of CLIP to visual-textual retrieval tasks and obtained new state-of-the-art (SOTA) results. Of course, as a specific application for image-text cross-modal retrieval, T-ReID can also benefit from CLIP.…”

Section: Vision-language Pre-training Modelsmentioning

confidence: 99%

Text-Based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning

Wu,

Ma,

Guo

et al. 2024

AAAI

View full text Add to dashboard Cite

Text-based Person Re-identification (T-ReID), which aims at retrieving a specific pedestrian image from a collection of images via text-based information, has received significant attention. However, previous research has overlooked a challenging yet practical form of T-ReID: dealing with image galleries mixed with occluded and inconsistent personal visuals, instead of ideal visuals with a full-body and clear view. Its major challenges lay in the insufficiency of benchmark datasets and the enlarged semantic gap incurred by arbitrary occlusions and modality gap between text description and visual representation of the target person. To alleviate these issues, we first design an Occlusion Generator (OGor) for the automatic generation of artificial occluded images from generic surveillance images. Then, a fine-granularity token selection mechanism is proposed to minimize the negative impact of occlusion for robust feature learning, and a novel multi-granularity contrastive consistency alignment framework is designed to leverage intra-/inter-granularity of visual-text representations for semantic alignment of occluded visuals and query texts. Experimental results demonstrate that our method exhibits superior performance. We believe this work could inspire the community to investigate more dedicated designs for implementing T-ReID in real-world scenarios. The source code is available at https://github.com/littlexinyi/MGCC.

show abstract

“…The text description is tokenized and enclosed with [SOS] and [EOS] tokens to indicate the sequence's beginning and end. Following recent methods (Shu et al 2023;Wei et al 2023), we randomly mask the word tokens of the input text t i with a probability (usually 15% or 30%) and replace them with the special token [MASK] during training. The masked text sequence then fed into the transformer to obtain sequence of contextual text embedding {f t sos , f t 1 , • • • , f t eos }, where the transformer uses masked self-attention to capture correlations among tokens.…”

Section: Image-text Dual Encodermentioning

confidence: 99%

Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

Zhao,

Liu,

et al. 2024

AAAI

View full text Add to dashboard Cite

Text-to-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited image-text relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multi-modal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts.

show abstract

See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval

Cited by 32 publications

References 30 publications

An Empirical Study of CLIP for Text-Based Person Search

An Empirical Study of CLIP for Text-Based Person Search

Text-Based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning

Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

Contact Info

Product

Resources

About