Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475465
|View full text |Cite
|
Sign up to set email alerts
|

Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace Learning

Abstract: Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-ba… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 19 publications
(22 citation statements)
references
References 39 publications
0
21
0
Order By: Relevance
“…However, Ours+ResNet obtained much higher R@{5,10} than RDE-GAN. Some metrics on 1k test set size (R@{5,10} in Image-to-Recipe and R@10 in Recipe-to-Image) are lower than X-MRS [9] and H-T [26]. Since the difference is less than 1.0 and Ours+ResNet has much higher R@1 (more than 2.0) and better medR (2.4 in medR), we suppose the proposed method with ResNet-50 image encoder backbone still outperforms the previous works.…”
Section: Cross-modal Recipe Retrievalmentioning
confidence: 96%
See 4 more Smart Citations
“…However, Ours+ResNet obtained much higher R@{5,10} than RDE-GAN. Some metrics on 1k test set size (R@{5,10} in Image-to-Recipe and R@10 in Recipe-to-Image) are lower than X-MRS [9] and H-T [26]. Since the difference is less than 1.0 and Ours+ResNet has much higher R@1 (more than 2.0) and better medR (2.4 in medR), we suppose the proposed method with ResNet-50 image encoder backbone still outperforms the previous works.…”
Section: Cross-modal Recipe Retrievalmentioning
confidence: 96%
“…MCEN [6] introduced cross-modal attention and consistency and Zan et al [33] introduced BERT [4] as a recipe encoder to enable cross-modal retrieval. X-MRS [9] introduced a Transformer [30] encoder to gain recipe embedding, further proposed the use of imperfect multilingual translations, and achieved state-of-the-art performances on retrieval tasks. Salvador et al proposed a simply but effective framework H-T [26], to facilitate the power of Transformer.…”
Section: Related Work 21 Cross-modal Recipe Retrievalmentioning
confidence: 99%
See 3 more Smart Citations