Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining

Zhan, Xunlin; Wu, Yangxin; Dong, Xiao; Wei, Yunchao; Lu, Minlong; Zhang, Yichi; Xu, Hang; Liang, Xiaodan

doi:10.1109/iccv48922.2021.01157

Cited by 39 publications

(30 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, that explicitly injects entity knowledge in both node-based and subgraph-based ways into the multi-modal networks via a self-supervised hybrid-stream transformer, which could reduce the confusion between different object contents, thereby effectively guiding the network to focus on entities with real semantic. Experimental results well verify the efficacy and generalizability of our EGE-CMP, outperforming several SOTA cross-modal baselines like CLIP [1], UNITER [2] and CAPTURE [3].…”

supporting

confidence: 58%

“…The results of experiments on both multi-product retrieval and identical-product retrieval tasks show the superiority of our EGE-CMP over the SOTA cross-modal baselines, such as ViLBERT [7], CLIP [1], UNITER [2], CAPTURE [3] and so on, on all major criteria by a large margin. Moreover, extensive ablation experiments are conducted to demonstrate the generalizability of EGE-CMP and investigate various essential factors of our proposed task.…”

Section: Introductionmentioning

confidence: 95%

“…With a wide variety of downstream E-commerce applications in the real world, we focus solely on the pretraining of multi-modality data of products. In the general scenarios, multi-modal visionlanguage pre-training model [1][2][3][7][8][9][10][11][12] such as CLIP [1] and VilBert [7] have shown superior performances on the diverse downstream tasks, such as zero-shot classification [13], crossmodal retrieval [14] and open-world detection [15]. They learn visual and textual embedding features from a huge number of imagetext pairs acquired from different sources and have quite generalization capability and robustness.…”

Section: Introductionmentioning

confidence: 99%

“…We devise a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) 3 framework to learn the instancelevel feature representations via injecting the entity knowledge with real semantic information to the visual-text alignment. We find our entity-graph significantly benefits the text-image cross-modal alignment.…”

mentioning

confidence: 99%

“…The study builds on our prior conference presentation, CAP-TURE [3]. In the version, we add the following features to the conference version: (1) To be suggested, a new goal, identicalproduct retrieval, must be adequately specified.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval

Xiao¹,

Zhan²,

Wei³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks to enable the evaluations on the price comparison and personalized recommendations. For both instance-level tasks, how to accurately pinpoint the product target mentioned in the visual-linguistic data and effectively decrease the influence of irrelevant contents is quite challenging. To address this, we exploit to train a more effective cross-modal pertaining model which is adaptively capable of incorporating key concept information from the multi-modal data, by using an entity graph whose node and edge respectively denote the entity and the similarity relation between entities. Specifically, a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, that explicitly injects entity knowledge in both node-based and subgraph-based ways into the multi-modal networks via a self-supervised hybrid-stream transformer, which could reduce the confusion between different object contents, thereby effectively guiding the network to focus on entities with real semantic. Experimental results well verify the efficacy and generalizability of our EGE-CMP, outperforming several SOTA cross-modal baselines like CLIP [1], UNITER [2] and CAPTURE [3].

show abstract

supporting

confidence: 58%

Section: Introductionmentioning

confidence: 95%