Following the previous work that generates meal image from ingredients, we extends it with more experiments in our new paper that has been accepted by WACV 2020.
Paper
Arxiv: https://arxiv.org/abs/2002.11493
- F. Han, R. Guerrero, and V. Pavlovic, “CookGAN: Meal Image Synthesis from Ingredients,” in Winter Conference on Applications of Computer Vision (WACV ’20), Aspen, Colorado, 2020.
[BibTeX]@InProceedings{han20wacv, author = {Fangda Han and Ricardo Guerrero and Vladimir Pavlovic}, booktitle = {Winter Conference on Applications of Computer Vision ({WACV} ’20)}, title = {{CookGAN}: Meal Image Synthesis from Ingredients}, year = {2020}, address = {Aspen, Colorado}, month = mar, date-added = {2019-09-09 15:01:10 -0400}, date-modified = {2019-09-09 15:02:39 -0400}, }
How Attention works in Association Model

To evaluate the effect of the attention mechanism used in the association model, we report the retrieval scores with or without attention. Interestingly, our model with attention does not achieve better performance.
This is somewhat counter-intuitive since it can be seen in Fig. 4 that the model with attention tends to focus on visually important ingredients.

For example, in top-left recipe, the model attends on green beans and chicken soup; in top-right recipe, the model attends on mushroom and leeks. It should be noted that the model does not simply attend on ingredients that appears more frequently in the dataset (e.g. olive_oil, water, butter) but learns to focus on the ingredients that are more visible for the recipe. We suspect the reason that attention mechanism does not improve to the performance scores is that the RNN model learns the importance of each ingredient implicitly. Nevertheless, the attention mechanism can exist as an unsupervised method to locate important ingredients for a recipe.
Canonical Ingredients in Meal Image Generation

Ours w/o CI uses the original ingredients and the proposed cycle-consistency constraint, while ours w/ CI uses the canonical ingredients and the proposed cycle-consistency constraint. From Table 2 we observe that using canonical ingredients does not always lead to better scores for the generative model. We argue that image quality is more related to the design of the generative model while the canonical ingredients help more on the conditioning on the text.
To evaluate the conditioning on the text, we investigate the median rank (MedR) by using synthesized images as the query to retrieve recipes with the association model. Table 2 suggests using cycle-consistency constraint outperforms the baseline StackGAN-v2 on most subsets, indicating the utility of the ingredient cycle-consistency. We also observe that applying canonical ingredients always leads to better MedR which demonstrates the effectiveness of our canonical-ingredients-based text embedding model. Still, the generated images remain apart from the real images in their retrieval ability, affirming the extreme difficulty of the photo-realistic meal image synthesis task.

Fig. 5 shows examples generated from different subsets. Within each category, the generated images capture the main ingredients for different recipes. Compared with StackGAN-v2, the images generated using cycle-consistency constraint usually have more clear ingredient appearance and looks more photo-realistic.
Component Analysis
Our generative model has two inputs, an ingredients feature c and a random vector z. In this section we analyze the different roles played by these two components.


Fig.6 shows examples generated from different ingredients with the same random vector z in the salad subset. The generated images contains different ingredients for different recipes while sharing a similar view point. This demonstrates the model’s ability to synthesize meal images conditioned on ingredient features c while keeping nuisance factors fixed through vector z.

Fig.7 further demonstrates the different roles of ingredients appearance c and random vector z by showing examples generated from same ingredients with different random vectors. The synthesized images have different view points, but still all appear to share the same ingredients.

To demonstrate the ability to synthesize meal images corresponding to specific key ingredient, in Fig. 8, we choose a target ingredient and show the synthesized images of linear interpolations between a pair of ingredient lists $r_i$ and $r_j$ (in the feature space), in which $r_i$ contains the target ingredient and $r_j$ is without it, but shares at least 70% of remaining ingredients in common with $r_i$ (The reason for choosing the partial overlap is because very few recipes differ in exactly one key ingredient.). One can observe that the model gradually removes the target ingredient during the interpolation-based removal process.
Conclusion
We hope the experiments in this post can further show details for our meal image generation framework. Please read our paper for a more comprehensive study .
This paper has been accepted by WACV 2020.