Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Salvador, Amaia; Gundogdu, Erhan; Bazzani, Loris; Donoser, Michael

doi:10.1109/cvpr46437.2021.01522

Cited by 48 publications

(76 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, the author proposed a self-supervised loss function computed based on pairs of the individual recipe components to leverage the semantic relationships within recipes. Whereas in [8] the author developed a neural with joint embedding learned on the recipes and images in common space. In the model a high-level classification task was added to further improve the classification performance.…”

Section: Related Workmentioning

confidence: 99%

“…We first start with the models that deliver the state-of-theart (SOTA) performance on accuracy. Specifically, we focus on two models that frame the problem as a cross-modal recipe retrieval task [6] [8]. The main difference between these two models lies in the design of recipe encoder: 1) one uses a two-stage LSTM [6] while the other uses hierarchical transformers [6].…”

Section: Approachmentioning

confidence: 99%

“…2) [6] introduces an auxiliary self-supervised task and loss to learn the semantic relationship between recipe components, strengthening the recipe encoder. For image encoder, [6] uses the visual transformer ViT [2] while [8] uses the ResNet-50 [3]. We train these two models from scratch to compare the performance and cost.…”

Section: Approachmentioning

confidence: 99%

“…We train these two models from scratch to compare the performance and cost. For evaluation, we follow the procedure described in [8], which uses cosine similarity in the common space for ranking the relevant recipes and perform im2recipe retrieval on validation dataset in Recipe 1M [11]. The performance metrics we report include median rank (MedR) and recall rate at top K (R@K, K = 1, 5, 10) for all the retrievals.…”

Section: Approachmentioning

confidence: 99%

“…Build recipe library. A pre-trained recipe encoder from [8] is used to compute recipe embeddings. Recipes from Recipe1M [6] dataset and corresponding recipe embeddings are collected as a recipe library.…”

Section: Introduction 1introduction and Backgroundmentioning

confidence: 99%

See 4 more Smart Citations

RecipeSnap -- a lightweight image-to-recipe model

Chen¹,

Yin²,

Xu³

2022

Preprint

View full text Add to dashboard Cite

In this paper we want to address the problem of automation for recognition of photographed cooking dishes and generating the corresponding food recipes. Current image-to-recipe models are computation expensive and require powerful GPUs for model training and implementation. High computational cost prevents those existing models from being deployed on portable devices, like smart phones. To solve this issue we introduce a lightweight image-to-recipe prediction model, RecipeSnap, that reduces memory cost and computational cost by more than 90% while still achieving 2.0 MedR, which is in line with the state-of-the-art model. A pre-trained recipe encoder from [8] was used to compute recipe embeddings. Recipes from Recipe1M [6] dataset and corresponding recipe embeddings are collected as a recipe library (Figure 1), which are used for image encoder training (Figure 2) and image query (Figure 3) later. We use MobileNet-V2 as image encoder backbone, which makes our model suitable to portable devices. This model can be further developed into an application for smart phones with a few effort. A comparison of the performance between this lightweight model to other heavy models are presented in this paper. Code, data and models are publicly accessible 1 .

show abstract

Section: Related Workmentioning

confidence: 99%