In this work we propose a new computational framework, based on generative deep models, for synthesis of photo-realistic food meal images from textual descriptions of its ingredients. Previous works on synthesis of images from text typically rely on pre-trained text models to extract text features, followed by a generative neural network aimed to generate realistic images conditioned on the text features. These works mainly focus on generating spatially compact and well-defined categories of objects, such as birds or flowers. In contrast, meal images are significantly more complex, consisting of multiple ingredients whose appearance and spatial qualities are further modified by cooking methods.
We propose a method that first builds an attention-based ingredients-image association model, which is then used to condition a generative neural network tasked with synthesizing meal images. Furthermore, a cycle-consistent constraint is added to further improve image quality and control appearance. Extensive experiments show our model is able to generate meal image corresponding to the ingredients, which could be used to augment existing dataset for solving other computational food analysis problems.
Fangda Han, Ricardo Guerrero, Vladimir Pavlovic.
To generate meal images from ingredients, we propose a two-step solution.
- Train a recipe association model to find a shared latent space between ingredient sets and images
- Use the latent representation of ingredients to train a GAN to synthesize meal images conditioned on those ingredients.
Step 1. Attention-based Cross-Modal Association Model
The association model takes two modals — an ingredient set (ingr1, ingr2, ing3, …) and an image — as inputs and minimize their distance in the shared FoodSpace if the modals come from the same recipes, and maximize their distance otherwise.
By doing so we get effective Ingredients Encoder and Image Encoder. These encoders are fully trained and applied in the second step.
Step 2. Generative Meal Image Network
The generative adversarial network takes the ingredients as input and generates the corresponding meal image. We build upon StackGAN-v2~, which contains three branches stacked together.
A correctly-generated meal image should “contain” the ingredients it is conditioned on. Thus, a cycle-consistency term is introduced to keep the fake image contextually similar, in terms of ingredients, to the corresponding real image in FoodSpace.
Specifically, for a real image and the corresponding generated fake image, the cycle-consistency regularization aims at minimizing the cosine distance at different scales. This loss term is also shown at the bottom-right in the previous generative meal image network figure.
The dataset we use is based on Recipe1M , which contains more than one million recipes. We only use those recipes with at least one images (around 400k recipes) and split them into training, validating and testing dataset.
The original training dataset contains about 16k ingredient names, however some share similar meaning and could be potentially merged. We first select about 4k ingredient names by frequency and then train a word2vec model with all information including recipe instructions, title and ingredients, which is used to cluster the 4k ingredient names in the vector space, these proposed clusters are further confirmed by human annotators and finally get a canonical ingredient list of size ~2k. This semi-automatic process can be illustrated in the figure below.
Evaluate of Association Model
The table below shows a comparison between the baseline and ours with the canonical ingredient list. We not only achieve better result in the same task (e.g. 5K), even our performance on a larger retrieval range (e.g. 10K) is better than theirs on 5K.
Below we show some retrieved samples from our model. As can be seen, although the correct one is not always on the top (e.g. the green box), the model could still retrieve quite relative images given the query ingredients.
Meal Image Synthesis
As for the image synthesize, we compare with the baseline (e.g. StackGAN-v2), which was published in 2017. The result shows that our model with the cycle-consistency regularization achieves better result on most subsets.
We also investigate the median rank (MedR) by using synthesized images as the query to retrieve recipes with the association model trained in step 1. Our method outperforms StackGAN-v2 on most subsets, indicating both the utility of the ingredient cycle-consistency and the embedding model. Still, the generated images remain apart from the real images in their retrieval ability, affirming the extreme difficulty of the photo-realistic meal image synthesis task.
The figure below shows examples generated from different subsets. Within each category, the generated images capture the main ingredients for different recipes. Compared with StackGAN-v2, the images generated by our model usually have more clear ingredients appearance and looks more like the real image, which again shows the benefit of the cycle-consistency constraint.
Below shows generated images by interpolating between two ingredient lists in the FoodSpace with and without tomato (resp. blueberry). One can observe that the model gradually removes the target ingredient during the interpolation-based removal process.
In this paper, we develop a model for generating photo-realistic meal images based on sets of ingredients. We integrate the attention-based recipe association model with StackGAN-v2, aiming for the association model to yield the ingredients feature close to the real meal image in FoodSpace, with StackGAN-v2 attempting to reproduce this image class from the FoodSpace encoding. To improve the quality of generated images, we reuse the image encoder in the association model and design an ingredient cycle-consistency regularization term in the shared space. Finally, we demonstrate that processing the ingredients into a canonical vocabulary is a critical key step in the synthesis process.
Experimental results demonstrate that our model is able to synthesize natural-looking meal images corresponding to desired ingredients, both visually and quantitatively, through retrieval metrics.
In the future, we aim at adding additional information including recipe instructions and titles to further contextualize the factors such as the meal preparation, as well as combining the amount of each ingredient to synthesize images with arbitrary ingredients quantities.
1. Chen, Jing-Jing et al: Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval, 2018 ACM Multimedia Conference on Multimedia Conference
2. Zhang, Han et al: Stackgan++: Realistic image synthesis with stacked generative adversarial networks, arXiv preprint arXiv:1710.10916
3. Salvador, Amaia et al.: Learning cross-modal embeddings for cooking recipes and food images. Proceedings of the IEEE conference on computer vision and pattern recognition (2017)