Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Zhuge, Mingchen; Gao, Dehong; Fan, Deng-Ping; Jin, Linbo; Chen, Ben; Zhou, Haoming; Qiu, Minghui; Shao, Ling

doi:10.1109/cvpr46437.2021.01246

Cited by 81 publications

(64 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…the state-of-art commerce-domain pre-trained models [11,51]. We found even our smallest model already outperforms [11,51] with a clear margin, indicating CommerceMM's superior transferability.…”

Section: Transferability To Academic Datasetmentioning

confidence: 72%

“…One is using the ITM head to predict the matching score between the input image-text pair and rank the scores of all pairs [5,11,51].…”

Section: Downstream Tasksmentioning

confidence: 99%

“…We also evaluate how our pre-trained model performs on the academic dataset, e.g., FashionGen [31]. We strictly follow [51] constructing its image-text retrieval task. In its text-to-image retrieval, the model is required to pick the matched image from 101 images given a text.…”

Section: Transferability To Academic Datasetmentioning

confidence: 99%

“…Wearing my favorite green short sleeved dress😍 We walked into this room and started to get the impression of being an average high school student … Search Query: "Short sleeved dress in green" Recently vision-and-language representation learning is becoming a more and more popular research topic. This trend has also motivated people to study the commerce-specific pre-training [8,11,50,51]. In those works, the authors pre-train the transformerbased [37] model on commerce image-text pairs, then fine-tune it on image-text retrieval, image captioning, category recognition, etc.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

Yu¹,

Sinha²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce CommerceMM -a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Queryto-Product Retrieval, Image-to-Product Retrieval, etc. We follow the pre-training + fine-tuning training regime and present 5 effective pre-training tasks on image-text pairs. To embrace more common and diverse commerce data with text-to-multimodal, image-tomultimodal, and multimodal-to-multimodal mapping, we propose another 9 novel cross-modal and cross-pair retrieval tasks, called Omni-Retrieval pre-training. The pre-training is conducted in an efficient manner with only two forward/backward updates for the combined 14 tasks. Extensive experiments and analysis show the effectiveness of each task. When combining all pre-training tasks, our model achieves state-of-the-art performance on 7 commercerelated downstream tasks after fine-tuning. Additionally, we propose a novel approach of modality randomization to dynamically adjust our model under different efficiency constraints. CCS CONCEPTS• Computing methodologies → Neural networks; • Information systems → Multimedia and multimodal retrieval; Online shopping.

show abstract

Section: Transferability To Academic Datasetmentioning

confidence: 72%

“…One is using the ITM head to predict the matching score between the input image-text pair and rank the scores of all pairs [5,11,51].…”

Section: Downstream Tasksmentioning

confidence: 99%

Section: Transferability To Academic Datasetmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

Yu¹,

Sinha²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The background and other garment items in a given image are thus distractions and should be removed. To this end, a series of pre-processing steps are introduced: (1) We use a salient object detection model [41,57] to remove the background, which is an easy task given the typical clean background in fashion catalog images. (2) When there are multiple garments with the same category in one image (e.g., shoes and gloves), if they do not overlap, we only keep the one with the largest pixel area; (3) We delete the masks of garment parts (e.g., sleeves and pockets) but merge their attributes into the garments they belong to; (4) We delete the garments that have low-resolution or extreme aspect ratio; (5) If there are pixels of other garments in the bounding box, we mask these excess pixels with gray color.…”

Section: A Additional Information On Uigr Datasetmentioning

confidence: 99%

UIGR: Unified Interactive Garment Retrieval

Han¹,

He²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Interactive garment retrieval (IGR) aims to retrieve a target garment image based on a reference garment image along with user feedback on what to change on the reference garment. Two IGR tasks have been studied extensively: text-guided garment retrieval (TGR) and visually compatible garment retrieval (VCR). The user feedback for the former indicates what semantic attributes to change with the garment category preserved, while the category is the only thing to be changed explicitly for the latter, with an implicit requirement on style preservation. Despite the similarity between these two tasks and the practical need for an efficient system tackling both, they have never been unified and modeled jointly. In this paper, we propose a Unified Interactive Garment Retrieval (UIGR) framework to unify TGR and VCR. To this end, we first contribute a large-scale benchmark suited for both problems. We further propose a strong baseline architecture to integrate TGR and VCR in one model. Extensive experiments suggest that unifying two tasks in one framework is not only more efficient by requiring a single model only, it also leads to better performance. Code and datasets are available at GitHub.

show abstract

Multimodal Retrieval in E-Commerce

Hendriksen

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Cited by 81 publications

References 39 publications

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

UIGR: Unified Interactive Garment Retrieval

Multimodal Retrieval in E-Commerce

Contact Info

Product

Resources

About