2022
DOI: 10.48550/arxiv.2204.03972
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

FashionCLIP: Connecting Language and Images for Product Representations

Abstract: The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model for the fashion industry. We showcase its capabilities for retrieval, classification and grounding, and release… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…Differently, Mirchandani et al [ 54 ] introduced a novel fashion-specific pre-training framework based on weakly supervised triplets, while in [ 53 ], two different pre-training tasks were proposed, one based on multi-view contrastive learning and the other on pseudo-attribute classification. Another recent approach exploits the power of the CLIP model [ 41 ]; it is fine-tuned on more specific vision-and-language data for the fashion domain [ 55 ].…”
Section: Related Workmentioning
confidence: 99%
“…Differently, Mirchandani et al [ 54 ] introduced a novel fashion-specific pre-training framework based on weakly supervised triplets, while in [ 53 ], two different pre-training tasks were proposed, one based on multi-view contrastive learning and the other on pseudo-attribute classification. Another recent approach exploits the power of the CLIP model [ 41 ]; it is fine-tuned on more specific vision-and-language data for the fashion domain [ 55 ].…”
Section: Related Workmentioning
confidence: 99%
“…Since our fashion data is also abundant, most early works pre-train on the fashion domain directly. However, a number of recent works [2,3,10,16,52] suggest that a generic-domain pre-trained CLIP [60] generalizes even better on the fashion tasks. In this work, we also exploit a pre-trained CLIP model.…”
Section: Related Workmentioning
confidence: 99%
“…Conversely, our approach operates in a zero-shot fashion by using both CLIP retrieval and CLIP representations to generate suggestions onthe-fly. Finally, our work builds on top of the recent wave of contrastive-based methods for representational learning: while latent product representations have been extensively studied from multiple angles Xu et al, 2020), CLIP-like models are still very new in this domain: GradREC leverages the space learned by FashionCLIP, a fashion-fine tuning of the original CLIP (Chia et al, 2022).…”
Section: Related Workmentioning
confidence: 99%
“…a multi-modal model comprising an image and a text encoder. We refer to Chia et al (2022) for details on training and retrieval / classification capabilities: since FashionCLIP has independent value in the industry, GradREC does not require any specific pretraining.…”
Section: Dataset and Pre-trained Spacementioning
confidence: 99%