2023
DOI: 10.1007/978-3-031-25072-9_42
|View full text |Cite
|
Sign up to set email alerts
|

See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 32 publications
(7 citation statements)
references
References 30 publications
0
7
0
Order By: Relevance
“…ViTAA 55.97 75.84 83.52 -SSAN (Ding et al 2021) 61.37 80.15 86.73 -LapsCore (Wu et al 2021) 63.40 -87.80 -LGUR (Shao et al 2022) 65.25 83.12 89.00 -SAF (Li, Cao, and Zhang 2022) 64.13 82.62 88.40 58.61 IVT (Shu et al 2023) 65.59 83.11 89.21 60.66 RaSa (Bai et al 2023a) 76 Adapter) are even worse in performance. The three methods are skilled in the few-shot image classification since the large-scale data knowledge of CLIP from the pre-training phase shares the same data characteristics as the downstream classification task.…”
Section: Methodsmentioning
confidence: 99%
“…ViTAA 55.97 75.84 83.52 -SSAN (Ding et al 2021) 61.37 80.15 86.73 -LapsCore (Wu et al 2021) 63.40 -87.80 -LGUR (Shao et al 2022) 65.25 83.12 89.00 -SAF (Li, Cao, and Zhang 2022) 64.13 82.62 88.40 58.61 IVT (Shu et al 2023) 65.59 83.11 89.21 60.66 RaSa (Bai et al 2023a) 76 Adapter) are even worse in performance. The three methods are skilled in the few-shot image classification since the large-scale data knowledge of CLIP from the pre-training phase shares the same data characteristics as the downstream classification task.…”
Section: Methodsmentioning
confidence: 99%
“…The pre-training and fine-tuning paradigm has achieved great success, which drives the development of CV (Dosovitskiy et al 2020) and natural language processing (NLP) (Brown et al 2020). Many efforts (Yan et al 2022;Radford et al 2021;Ma et al 2022;Yao et al 2021;Cao et al 2022;Fang et al 2021;Shu et al 2022) have attempted to extend the pre-training model to the multimodal field. It is gratifying that visual language pre-training (VLP) has attracted growing attention.…”
Section: Vision-language Pre-training Modelsmentioning
confidence: 99%
“…As a leading pre-training model, different from the traditional single-modality supervised pre-training model, CLIP leverages natural text descriptions to supervise the learning. Since the great advantage of CLIP, a lot of follow-ups (Luo et al 2022;Fang et al 2021;Ma et al 2022;Zhao et al 2022;Shu et al 2022;Han et al 2021;Yan et al 2022) have also begun to transfer the knowledge of CLIP to visual-textual retrieval tasks and obtained new state-of-the-art (SOTA) results. Of course, as a specific application for image-text cross-modal retrieval, T-ReID can also benefit from CLIP.…”
Section: Vision-language Pre-training Modelsmentioning
confidence: 99%
“…The text description is tokenized and enclosed with [SOS] and [EOS] tokens to indicate the sequence's beginning and end. Following recent methods (Shu et al 2023;Wei et al 2023), we randomly mask the word tokens of the input text t i with a probability (usually 15% or 30%) and replace them with the special token [MASK] during training. The masked text sequence then fed into the transformer to obtain sequence of contextual text embedding {f t sos , f t 1 , • • • , f t eos }, where the transformer uses masked self-attention to capture correlations among tokens.…”
Section: Image-text Dual Encodermentioning
confidence: 99%