FashionCLIP: Connecting Language and Images for Product Representations

Chia, Patrick John; Attanasio, Giuseppe; Bianchi, F.; Terragni, Silvia; Magalhães, Ana Rita; Gonçalves, Diogo Nunes; Greco, Ciro; Tagliabue, Jacopo

doi:10.48550/arxiv.2204.03972

Cited by 3 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Differently, Mirchandani et al [ 54 ] introduced a novel fashion-specific pre-training framework based on weakly supervised triplets, while in [ 53 ], two different pre-training tasks were proposed, one based on multi-view contrastive learning and the other on pseudo-attribute classification. Another recent approach exploits the power of the CLIP model [ 41 ]; it is fine-tuned on more specific vision-and-language data for the fashion domain [ 55 ].…”

Section: Related Workmentioning

confidence: 99%

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Moratelli

Barraco

Morelli

et al. 2023

Sensors

View full text Add to dashboard Cite

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

show abstract

Section: Related Workmentioning

confidence: 99%

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Moratelli

Barraco

Morelli

et al. 2023

Sensors

View full text Add to dashboard Cite

show abstract

“…Since our fashion data is also abundant, most early works pre-train on the fashion domain directly. However, a number of recent works [2,3,10,16,52] suggest that a generic-domain pre-trained CLIP [60] generalizes even better on the fashion tasks. In this work, we also exploit a pre-trained CLIP model.…”

Section: Related Workmentioning

confidence: 99%

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Han¹,

Zhu²,

Yu³

et al. 2023

Preprint

View full text Add to dashboard Cite

In the fashion domain, there exists a variety of visionand-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individual input/output format and dataset size. It has been common to design a task-specific model and fine-tune it independently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to exploit inter-task relatedness. To address such issues, we propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a stable and effective multi-task training strategy that supports learning from heterogeneous data and prevents negative transfer. Extensive experiments on four fashion tasks show that our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models. Code is available at https://github.com/BrandonHanx/FAME-ViL.

show abstract

“…Conversely, our approach operates in a zero-shot fashion by using both CLIP retrieval and CLIP representations to generate suggestions onthe-fly. Finally, our work builds on top of the recent wave of contrastive-based methods for representational learning: while latent product representations have been extensively studied from multiple angles Xu et al, 2020), CLIP-like models are still very new in this domain: GradREC leverages the space learned by FashionCLIP, a fashion-fine tuning of the original CLIP (Chia et al, 2022).…”

Section: Related Workmentioning

confidence: 99%

“…a multi-modal model comprising an image and a text encoder. We refer to Chia et al (2022) for details on training and retrieval / classification capabilities: since FashionCLIP has independent value in the industry, GradREC does not require any specific pretraining.…”

Section: Dataset and Pre-trained Spacementioning

confidence: 99%

"Does it come in black?" CLIP-like models are zero-shot recommenders

Chia¹,

Tagliabue²,

Bianchi³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Product discovery is a crucial component for online shopping. However, item-to-item recommendations today do not allow users to explore changes along selected dimensions: given a query item, can a model suggest something similar but in a different color? We consider item recommendations of the comparative nature (e.g. "something darker") and show how CLIP-based models can support this use case in a zero-shot manner. Leveraging a large model built for fashion, we introduce GradREC and its industry potential, and offer a first rounded assessment of its strength and weaknesses. * * GradRECS started as a (failed) experiment by JT; PC actually made it work, and he is the lead researcher on the project. FB, CG and DC all contributed to the paper, providing support for modelling, industry context and domain knowledge. PC and JT are the corresponding authors.

show abstract

FashionCLIP: Connecting Language and Images for Product Representations

Cited by 3 publications

References 17 publications

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

"Does it come in black?" CLIP-like models are zero-shot recommenders

Contact Info

Product

Resources

About