Pre-training Graph Transformer with Multimodal Side Information for Recommendation

Liu, Yong; Yang, Susen; Lei, Chenyi; Wang, Guoxin; Tang, Haitao; Zhang, Jie; Sun, Aixin; Chen, Miao

doi:10.1145/3474085.3475709

Cited by 59 publications

(47 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To verify the effectiveness of our proposed pre-training model, we compare it with the following representative baseline methods: Random, CLIP [17], MMGCN [29], GPT-GNN [8], Graph-BERT [31], PMGT [12], and GCN-P [14]. A short description of all baselines is given in Appendix B.…”

Section: Baseline Methodsmentioning

confidence: 99%

“…However, it ignores the masking operations on the nodes, which may limit the ability to aggregate the features of different nodes. To address this problem, PMGT [12] designs a masked node feature reconstruction task, which aims to reconstruct the features of masked nodes by other non-masked nodes so that it improves the recommendation performance. Unlike these existing methods, our proposed multi-modal contrastive pre-training method aims to integrate the multi-modalities information both on the user side and item side, capture modality-specific features and aggregate cross-modality information from both users and items.…”

Section: Related Workmentioning

confidence: 99%

“…However, it does not employ masking operations on the nodes. • PMGT [12] is a pre-trained multi-modal graph transformer model, which learns item representations by considering both item side information and their relationships. Different from Graph-BERT, it designs a masked node feature reconstruction task to obtain more refined embeddings.…”

Section: B Baselinesmentioning

confidence: 99%

“…For each user, we utilize the review texts to initialize its features. Following [12], we convert all the observed review ratings to be positive interactions and filter out the products that are not included in the metadata files. • Sharee Dataset: Sharee (now renamed Lemon8) is a stream of interest information under ByteDance for the Japanese market, which is a benchmark for the Japanese version of Xiaohongshu.…”

Section: A Datasetsmentioning

confidence: 99%

“…modal-specific graph and conducts graph convolutional operations to capture the modal-specific user preference and item representations. [12] proposes a pre-training multi-modal graph transformer method, which learns the item representations with graph structure reconstruction and masked node feature reconstruction. However, these two methods ignore the users' review information and do not better capture the potential correlation of users and items.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Multi-Modal Contrastive Pre-training for Recommendation

Liu

Schubert

et al. 2022

Proceedings of the 2022 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Personalized recommendation plays a central role in various online applications. To provide quality recommendation service, it is of crucial importance to consider multi-modal information associated with users and items, e.g., review text, description text, and images. However, many existing approaches do not fully explore and fuse multiple modalities. To address this problem, we propose a multimodal contrastive pre-training model for recommendation. We first construct a homogeneous item graph and a user graph based on the relationship of co-interaction. For users, we propose intramodal aggregation and inter-modal aggregation to fuse review texts and the structural information of the user graph. For items, we consider three modalities: description text, images, and item graph. Moreover, the description text and image complement each other for the same item. One of them can be used as promising supervision for the other. Therefore, to capture this signal and better exploit the potential correlation of intra-modalities, we propose a self-supervised contrastive inter-modal alignment task to make the textual and visual modalities as similar as possible. Then, we apply inter-modal aggregation to obtain the multi-modal representation of items. Next, we employ a binary cross-entropy loss function to capture the potential correlation between users and items. Finally, we fine-tune the pre-trained multi-modal representations using an existing recommendation model. We have performed extensive experiments on three real-world datasets. Experimental results verify the rationality and effectiveness of the proposed method.

show abstract

Section: Baseline Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: B Baselinesmentioning

confidence: 99%

Section: A Datasetsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Multi-Modal Contrastive Pre-training for Recommendation

Liu

Schubert

et al. 2022

Proceedings of the 2022 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

show abstract

Self‐supervised graph learning for occasional group recommendation

Hao¹,

Yin²,

Zhang³

et al. 2022

Int J of Intelligent Sys

View full text Add to dashboard Cite

As an important branch in Recommender System, occasional group recommendation has received more and more attention. In this scenario, each occasional group (cold‐start group) has no or few historical interacted items. As each occasional group has extremely sparse interactions with items, traditional group recommendation methods can not learn high‐quality group representations. The recent proposed Graph Neural Networks (GNNs), which incorporate the high‐order neighbors of the target occasional group, can alleviate the above problem in some extent. However, these GNNs still can not explicitly strengthen the embedding quality of the high‐order neighbors with few interactions. Motivated by the self‐supervised learning technique, which is able to find the correlations within the data itself, we propose a self‐supervised graph learning framework, which takes the user/item/group embedding reconstruction as the pretext task to enhance the embeddings of the cold‐start users/items/groups. To explicitly enhance the high‐order cold‐start neighbors' embedding quality, we further introduce an embedding enhancer, which leverages the self‐attention mechanism to improve the embedding quality for them. Comprehensive experiments show the advantages of our proposed framework than the state‐of‐the‐art methods.

show abstract