Collocation and Try-on Network

Zheng, Na; Song, Xuemeng; Niu, Qingying; Dong, Xue; Zhan, Yibing; Nie, Liqiang

doi:10.1145/3474085.3475691

Cited by 16 publications

(3 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…cross domain retrieval) (Tangseng, Yamaguchi, and Okatani 2017;Li et al 2017;Han et al 2017;Hsiao and Grauman 2018;Tangseng, Yamaguchi, and Okatani 2017;Shih et al 2018;Li et al 2020), set complementary item retrieval (Hu, Yi, and Davis 2015;Huang et al 2015;Liu et al 2012), personalized set complementary item prediction (requires user input) (Taraviya et al 2021;Chen et al 2019;Li et al 2020;Su et al 2021;Zheng et al 2021;Guan et al 2022b,a) and multi-modal complementary item prediction (Guan et al 2021). All these prior work focus on feature representation learning.…”

Section: Related Workmentioning

confidence: 99%

ICAR: Image-Based Complementary Auto Reasoning

Wang,

Liang,

Liang

et al. 2024

AAAI

View full text Add to dashboard Cite

Scene-aware Complementary Item Retrieval (CIR) is a challenging task which requires to generate a set of compatible items across domains. Due to the subjectivity, it is difficult to set up a rigorous standard for both data collection and learning objectives. To address this challenging task, we propose a visual compatibility concept, composed of similarity (resembling in color, geometry, texture, and etc.) and complementarity (different items like table vs chair completing a group). Based on this notion, we propose a compatibility learning framework, a category-aware Flexible Bidirectional Transformer (FBT), for visual ``scene-based set compatibility reasoning'' with the cross-domain visual similarity input and auto-regressive complementary item generation. We introduce a ``Flexible Bidirectional Transformer (FBT),'' consisting of an encoder with flexible masking, a category prediction arm, and an auto-regressive visual embedding prediction arm. And the inputs for FBT are cross-domain visual similarity invariant embeddings, making this framework quite generalizable. Furthermore, our proposed FBT model learns the inter-object compatibility from a large set of scene images in a self-supervised way. Compared with the SOTA methods, this approach achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% SFID improvement on fashion and furniture, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

ICAR: Image-Based Complementary Auto Reasoning

Wang,

Liang,

Liang

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Suppose there are 𝑃 global attributes. We then deploy 𝑃 learnable condition masks [40] on f 𝑟 to derive the global attribute features of the reference image as follows,…”

Section: Attribute Feature Extractionmentioning

confidence: 99%

Target-Guided Composed Image Retrieval

Wen,

Zhang,

Song

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can retrieve the target image for a multimodal query, including a reference image and its corresponding modification text. Although existing efforts have achieved compelling success, they overlook the conflict relationship modeling between the reference image and the modification text for improving the multimodal query composition and the adaptive matching degree modeling for promoting the ranking of the candidate images that could present different levels of matching degrees with the given query. To address these two limitations, in this work, we propose a Target-Guided Composed Image Retrieval network (TG-CIR). In particular, TG-CIR first extracts the unified global and local attribute features for the reference/target image and the modification text with the contrastive language-image pre-training model (CLIP) as the backbone, where an orthogonal regularization is introduced to promote the independence among the attribute features. Then TG-CIR designs a target-query relationship-guided multimodal query composition module, comprising a target-free student composition branch and a target-based teacher composition branch, where the target-query relationship is injected into the teacher branch for guiding the conflict relationship modeling of the student branch. Last, apart from the conventional batch-based classification loss, TG-CIR additionally introduces a batch-based target similarity-guided matching degree regularization to promote the metric learning process. Extensive experiments on three benchmark datasets demonstrate the superiority of our proposed method. CCS CONCEPTS• Information systems → Image search.

show abstract

“…In addition, in the computer vision domain, Caramalau et al [2] presented a novel sequential GCN to learn node representations and distinguish sufficiently different unlabeled examples from labeled examples for active learning, and Zhang et al [53] devised a multimodal interaction GCN to jointly explore the complex intramodal relations and inter-modal interactions for temporal language localization in videos. Zheng et al [55] integrated disentangled item representations into a GCN to adaptively propagate the finegrained compatibility relationships among items for outfit compatibility modeling, Wang et al [46] developed a novel neural graph collaborative filtering method that integrated user-item interactions into a user embedding process for recommendation systems.…”

Section: Graph Convolutional Networkmentioning

confidence: 99%

Geochemical Characteristics and Environmental Implications of the Elements of Qingshan Group on the Zhougezhuang Section, Shandong Province, China

Li¹,

Zhou²,

Zhou³

2020

Goldschmidt Abstracts

View full text Add to dashboard Cite

Existing data-to-text generation efforts mainly focus on generating a coherent text from non-linguistic input data, such as tables and attribute-value pairs, but overlook that different application scenarios may require texts of different styles. Inspired by this, we define a new task, namely stylized data-to-text generation, whose aim is to generate coherent text for the given non-linguistic data according to a specific style. This task is non-trivial, due to three challenges: the logic of the generated text, unstructured style reference, and biased training samples. To address these challenges, we propose a novel stylized data-to-text generation model, named StyleD2T, comprising three components: logic planning-enhanced data embedding, mask-based style embedding, and unbiased stylized text generation. In the first component, we introduce a graph-guided logic planner for attribute organization to ensure the logic of generated text. In the second component, we devise feature-level mask-based style embedding to extract the essential style signal from the given unstructured style reference. In the last one, pseudo triplet augmentation is utilized to achieve unbiased text generation, and a multi-condition based confidence assignment function is designed to ensure the quality of pseudo samples. Extensive experiments on a newly collected dataset from Taobao 1 have been conducted, and the results show the superiority of our model over existing methods.

show abstract

Collocation and Try-on Network

Cited by 16 publications

References 38 publications

ICAR: Image-Based Complementary Auto Reasoning

ICAR: Image-Based Complementary Auto Reasoning

Target-Guided Composed Image Retrieval

Geochemical Characteristics and Environmental Implications of the Elements of Qingshan Group on the Zhougezhuang Section, Shandong Province, China

Contact Info

Product

Resources

About