Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace Learning

Pham, Hai; Pavlović, Vladimir

doi:10.1145/3474085.3475465

Cited by 19 publications

(22 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, Ours+ResNet obtained much higher R@{5,10} than RDE-GAN. Some metrics on 1k test set size (R@{5,10} in Image-to-Recipe and R@10 in Recipe-to-Image) are lower than X-MRS [9] and H-T [26]. Since the difference is less than 1.0 and Ours+ResNet has much higher R@1 (more than 2.0) and better medR (2.4 in medR), we suppose the proposed method with ResNet-50 image encoder backbone still outperforms the previous works.…”

Section: Cross-modal Recipe Retrievalmentioning

confidence: 96%

“…MCEN [6] introduced cross-modal attention and consistency and Zan et al [33] introduced BERT [4] as a recipe encoder to enable cross-modal retrieval. X-MRS [9] introduced a Transformer [30] encoder to gain recipe embedding, further proposed the use of imperfect multilingual translations, and achieved state-of-the-art performances on retrieval tasks. Salvador et al proposed a simply but effective framework H-T [26], to facilitate the power of Transformer.…”

Section: Related Work 21 Cross-modal Recipe Retrievalmentioning

confidence: 99%

“…GAN [7] has been introduced and proven its effectiveness for improving recipe retrieval performance in some recent works [9,29,31,36]. R2GAN [36], ACME [31] and X-MRS [9] applied textconditioned image synthesis which generates recipe images from recipe embeddings, while RDE-GAN [29] proposed to disentangle image features into dish shape features, which contained only nonrecipe information, and recipe embeddings and to integrate both of them to generate recipe images.…”

Section: Food Image Synthesismentioning

confidence: 99%

“…We also evaluated the performance of image generation conditioned on recipe embeddings using Fréchet Inception Distance (FID) [15] score following previous works [9,29] measuring the similarity of the distribution between generated images and groundtruth images. Training Details.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…MCEN [6] applied attention mechanism, and Zan et al [33] applied BERT [4] for recipe encoding. With the trend of Transformer [30] in natural language processing, X-MRS [9] and H-T [26] were recently proposed to adopt this technique in recipe retrieval. The research H-T in particular introduced a simple but effective framework with a Transformer-based structure recipe encoder and self-supervised learning, allowing the model to explore complementary information among recipe texts.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training

Yang¹,

Chen²,

Yanai³

2022

Preprint

View full text Add to dashboard Cite

In this paper, we present a cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT), which is inspired by ACME (Adversarial Cross-Modal Embedding) and H-T (Hierarchical Transformer). TNLBT aims to accomplish retrieval tasks while generating images from recipe embeddings. We apply the Hierarchical Transformer-based recipe text encoder, the Vision Transformer (ViT)-based recipe image encoder, and an adversarial network architecture to enable better cross-modal embedding learning for recipe texts and images. In addition, we use self-supervised learning to exploit the rich information in the recipe texts having no corresponding images. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we adopt a large batch size during training and have validated its effectiveness. In the experiments, the proposed framework significantly outperformed the current state-of-the-art frameworks in both cross-modal recipe retrieval and image generation tasks on the benchmark Recipe1M. This is the first work which confirmed the effectiveness of large batch training on cross-modal recipe embeddings.

show abstract

Section: Cross-modal Recipe Retrievalmentioning

confidence: 96%