CookGAN: Meal Image Synthesis from Ingredients

Following the previous work that generates meal image from ingredients, we extends it with more experiments in our new paper that has been accepted by WACV 2020.

Paper

Arxiv: https://arxiv.org/abs/2002.11493

  • F. Han, R. Guerrero, and V. Pavlovic, “CookGAN: Meal Image Synthesis from Ingredients,” in Winter Conference on Applications of Computer Vision (WACV ’20), Aspen, Colorado, 2020.
    [BibTeX]
    @InProceedings{han20wacv,
    author = {Fangda Han and Ricardo Guerrero and Vladimir Pavlovic},
    booktitle = {Winter Conference on Applications of Computer Vision ({WACV} ’20)},
    title = {{CookGAN}: Meal Image Synthesis from Ingredients},
    year = {2020},
    address = {Aspen, Colorado},
    month = mar,
    date-added = {2019-09-09 15:01:10 -0400},
    date-modified = {2019-09-09 15:02:39 -0400},
    }

How Attention works in Association Model

To evaluate the effect of the attention mechanism used in the association model, we report the retrieval scores with or without attention. Interestingly, our model with attention does not achieve better performance.

This is somewhat counter-intuitive since it can be seen in Fig. 4 that the model with attention tends to focus on visually important ingredients.

Figure 4: Attention on ingredients

For example, in top-left recipe, the model attends on green beans and chicken soup; in top-right recipe, the model attends on mushroom and leeks. It should be noted that the model does not simply attend on ingredients that appears more frequently in the dataset (e.g. olive_oil, water, butter) but learns to focus on the ingredients that are more visible for the recipe. We suspect the reason that attention mechanism does not improve to the performance scores is that the RNN model learns the importance of each ingredient implicitly. Nevertheless, the attention mechanism can exist as an unsupervised method to locate important ingredients for a recipe.

Canonical Ingredients in Meal Image Generation

Ours w/o CI uses the original ingredients and the proposed cycle-consistency constraint, while ours w/ CI uses the canonical ingredients and the proposed cycle-consistency constraint. From Table 2 we observe that using canonical ingredients does not always lead to better scores for the generative model. We argue that image quality is more related to the design of the generative model while the canonical ingredients help more on the conditioning on the text.

To evaluate the conditioning on the text, we investigate the median rank (MedR) by using synthesized images as the query to retrieve recipes with the association model. Table 2 suggests using cycle-consistency constraint outperforms the baseline StackGAN-v2 on most subsets, indicating the utility of the ingredient cycle-consistency. We also observe that applying canonical ingredients always leads to better MedR which demonstrates the effectiveness of our canonical-ingredients-based text embedding model. Still, the generated images remain apart from the real images in their retrieval ability, affirming the extreme difficulty of the photo-realistic meal image synthesis task.

Figure 5: Example results by StackGAN-v2 and our model conditioned on target ingredients, the real images are also shown for reference. CI means Canonical Ingredients we proposed.

Fig. 5 shows examples generated from different subsets. Within each category, the generated images capture the main ingredients for different recipes. Compared with StackGAN-v2, the images generated using cycle-consistency constraint usually have more clear ingredient appearance and looks more photo-realistic.

Component Analysis

Our generative model has two inputs, an ingredients feature c and a random vector z. In this section we analyze the different roles played by these two components.

Two inputs of the generotor: an ingredient feature c as the content and a random vector z as the style
Figure 6: Example results from different ingredients c with same random vector z in the salad subset.

Fig.6 shows examples generated from different ingredients with the same random vector z in the salad subset. The generated images contains different ingredients for different recipes while sharing a similar view point. This demonstrates the model’s ability to synthesize meal images conditioned on ingredient features c while keeping nuisance factors fixed through vector z.

Figure 7: Example results from same ingredients with different random vectors. 8 synthesized images are shown for each real image (top-left).

Fig.7 further demonstrates the different roles of ingredients appearance c and random vector z by showing examples generated from same ingredients with different random vectors. The synthesized images have different view points, but still all appear to share the same ingredients.

Figure 8: Example results of synthesized images from the linear interpolations in FoodSpace between two recipes (with and without target ingredient). Target ingredient on the left is tomato and the model is trained with salad subset; target ingredient on the right is blueberry and the model is trained with muffin subset.

To demonstrate the ability to synthesize meal images corresponding to specific key ingredient, in Fig. 8, we choose a target ingredient and show the synthesized images of linear interpolations between a pair of ingredient lists $r_i$ and $r_j$ (in the feature space), in which $r_i$ contains the target ingredient and $r_j$ is without it, but shares at least 70% of remaining ingredients in common with $r_i$ (The reason for choosing the partial overlap is because very few recipes differ in exactly one key ingredient.). One can observe that the model gradually removes the target ingredient during the interpolation-based removal process.

Conclusion

We hope the experiments in this post can further show details for our meal image generation framework. Please read our paper for a more comprehensive study .

This paper has been accepted by WACV 2020.

Leave a Reply

Your email address will not be published.