The Art of Food: Meal Image Synthesis from Ingredients

The task is to generate a meal image given a set of ingredients

Fangda Han


In this work we propose a new computational framework, based on generative deep models, for synthesis of photo-realistic food meal images from textual descriptions of its ingredients. Previous works on synthesis of images from text typically rely on pre-trained text models to extract text features, followed by a generative neural network aimed to generate realistic images conditioned on the text features. These works mainly focus on generating spatially compact and well-defined categories of objects, such as birds or flowers. In contrast, meal images are significantly more complex, consisting of multiple ingredients whose appearance and spatial qualities are further modified by cooking methods.

We propose a method that first builds an attention-based ingredients-image association model, which is then used to condition a generative neural network tasked with synthesizing meal images. Furthermore, a cycle-consistent constraint is added to further improve image quality and control appearance. Extensive experiments show our model is able to generate meal image corresponding to the ingredients, which could be used to augment existing dataset for solving other computational food analysis problems.


Fangda Han, Ricardo Guerrero, Vladimir Pavlovic.

The Art of Food: Meal Image Synthesis from Ingredients.


To generate meal images from ingredients, we propose a two-step solution.

  • Train a recipe association model to find a shared latent space between ingredient sets and images
  • Use the latent representation of ingredients to train a GAN to synthesize meal images conditioned on those ingredients.

Step 1. Attention-based Cross-Modal Association Model

Attention-based Cross-Modal Association Model

The association model takes two modals — an ingredient set (ingr1, ingr2, ing3, …) and an image — as inputs and minimize their distance in the shared FoodSpace if the modals come from the same recipes, and maximize their distance otherwise.

Loss function of the association model

By doing so we get effective Ingredients Encoder and Image Encoder. These encoders are fully trained and applied in the second step.

Step 2. Generative Meal Image Network

Generative Meal Image Network

The generative adversarial network takes the ingredients as input and generates the corresponding meal image. We build upon StackGAN-v2~[1], which contains three branches stacked together.

Loss function of the generative model

Cycle-consistency Constraint

A correctly-generated meal image should “contain” the ingredients it is conditioned on. Thus, a cycle-consistency term is introduced to keep the fake image contextually similar, in terms of ingredients, to the corresponding real image in FoodSpace.

Specifically, for a real image and the corresponding generated fake image, the cycle-consistency regularization aims at minimizing the cosine distance at different scales. This loss term is also shown at the bottom-right in the previous generative meal image network figure.



The dataset we use is based on Recipe1M [], which contains more than one million recipes. We only use those recipes with at least one images (around 400k recipes) and split them into training, validating and testing dataset.

The original training dataset contains about 16k ingredient names, however some share similar meaning and could be potentially merged. We first select about 4k ingredient names by frequency and then train a word2vec model with all information including recipe instructions, title and ingredients, which is used to cluster the 4k ingredient names in the vector space, these proposed clusters are further confirmed by human annotators and finally get a canonical ingredient list of size ~2k. This semi-automatic process can be illustrated in the figure below.

Evaluate of Association Model

The table below shows a comparison between the baseline and ours with the canonical ingredient list. We not only achieve better result in the same task (e.g. 5K), even our performance on a larger retrieval range (e.g. 10K) is better than theirs on 5K.

Below we show some retrieved samples from our model. As can be seen, although the correct one is not always on the top (e.g. the green box), the model could still retrieve quite relative images given the query ingredients.

Meal Image Synthesis

As for the image synthesize, we compare with the baseline (e.g. StackGAN-v2), which was published in 2017. The result shows that our model with the cycle-consistency regularization achieves better result on most subsets.

We also investigate the median rank (MedR) by using synthesized images as the query to retrieve recipes with the association model trained in step 1. Our method outperforms StackGAN-v2 on most subsets, indicating both the utility of the ingredient cycle-consistency and the embedding model. Still, the generated images remain apart from the real images in their retrieval ability, affirming the extreme difficulty of the photo-realistic meal image synthesis task.

The figure below shows examples generated from different subsets. Within each category, the generated images capture the main ingredients for different recipes. Compared with StackGAN-v2, the images generated by our model usually have more clear ingredients appearance and looks more like the real image, which again shows the benefit of the cycle-consistency constraint.

Below shows generated images by interpolating between two ingredient lists in the FoodSpace with and without tomato (resp. blueberry). One can observe that the model gradually removes the target ingredient during the interpolation-based removal process.


In this paper, we develop a model for generating photo-realistic meal images based on sets of ingredients. We integrate the attention-based recipe association model with StackGAN-v2, aiming for the association model to yield the ingredients feature close to the real meal image in FoodSpace, with StackGAN-v2 attempting to reproduce this image class from the FoodSpace encoding. To improve the quality of generated images, we reuse the image encoder in the association model and design an ingredient cycle-consistency regularization term in the shared space. Finally, we demonstrate that processing the ingredients into a canonical vocabulary is a critical key step in the synthesis process.

Experimental results demonstrate that our model is able to synthesize natural-looking meal images corresponding to desired ingredients, both visually and quantitatively, through retrieval metrics.

In the future, we aim at adding additional information including recipe instructions and titles to further contextualize the factors such as the meal preparation, as well as combining the amount of each ingredient to synthesize images with arbitrary ingredients quantities.


1. Chen, Jing-Jing et al: Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval, 2018 ACM Multimedia Conference on Multimedia Conference
2. Zhang, Han et al: Stackgan++: Realistic image synthesis with stacked generative adversarial networks, arXiv preprint arXiv:1710.10916
3. Salvador, Amaia et al.: Learning cross-modal embeddings for cooking recipes and food images. Proceedings of the IEEE conference on computer vision and pattern recognition (2017)


    Unsupervised Visual Domain Adaptation:A Deep Max-Margin Gaussian Process Approach : Oral Paper at CVPR 2019

    This is a joint work by Minyoung Kim, Pritish Sahu, Behnam Gholami, Vladimir Pavlovic.

    For more information please visit: “


    [1] M. Kim, P. Sahu, B. Gholami, and V. Pavlovic, “Unsupervised Visual Domain Adaptation: A Deep Max-Margin Gaussian Process Approach,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.


    author = {Kim, Minyoung and Sahu, Pritish and Gholami, Behnam and Pavlovic, Vladimir},
    title = {Unsupervised Visual Domain Adaptation: A Deep Max-Margin Gaussian Process Approach},
    booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2019}

    cs535 special permission or prerequisite override requests for Fall 2018

    If you are interested in registering for my Fall 2018  cs535 Pattern Recognition and do not have the prerequisites or for some other reason need a Special Permission (SPN) or Prerequisite Override, you will need to fill out a request here:

    cs535 Fall 2018 SPN & Prereq Request Form

    Note that you will need a Google account in order to sing in and see the form.

    Please do not email me with individual requests.  I will be issuing SPNs and Prerequisite Overrides no earlier than 2 weeks before the start of the Fall 2018 semester.

    ICCV 2017 – Wrap Up

    Last week was the time for ICCV 2017 in Venice, Italy.  Setting aside my personal indifference to Venice (overcrowded and overpriced, uninspiring food),  ICCV was an interesting meeting.  It took place on Lido, at the same venue as the Venice Film Festival.  It was certainly a more appealing location to me personally, but a bit of a way from San Marco and the tourist Venice.

    The conference itself was, well, a mixed bag.  Plenty of deep block shuffling works.  About 3000 attendees is the number I heard.

    We presented our work “PUnDA: Probabilistic Unsupervised Domain Adaptation for Knowledge Transfer Across Visual Categories” [1]. It shows that it is not always necessary to have end-to-end learning if the transfer model is properly constructed.   As the luck has it, I had to reshuffle the schedule so the poster was both on Wednesday and Thursday (thanks to Robert Walecki, who help with the poster.)  You can find a copy of the poster below.

    [1] B. Gholami, O. Rudovic, and V. Pavlovic, “PUnDA: Probabilistic Unsupervised Domain Adaptation,” in Proc. IEEE International Conference Computer Vision, 2017.
    added-at = {2017-07-18T17:26:03.000+0200},
    author = {Gholami, Behnam and Rudovic, Ognjen and Pavlovic, Vladimir},
    biburl = {},
    booktitle = {Proc. IEEE International Conference Computer Vision},
    interhash = {4a8c0ac9cfbd7b02e2d86ef7b5a289a0},
    intrahash = {9fd2aed9b960e71bf2472b5d70079060},
    keywords = {domain_adaptation myown unsupervised},
    owner = {vladimir},
    timestamp = {2017-07-18T18:21:19.000+0200},
    title = {PUnDA: Probabilistic Unsupervised Domain Adaptation},
    year = 2017

    cs536 – Machine Learning, Spring 2018 – Registration Requests

    If you are interested in registering for my Spring 2018 Machine Learning course (01:198:536) and do not have the prerequisites or for some other reason need a Special Permission (SPN) or Prerequisite Override, you will need to fill out a request here:

    cs535 Spring 2018 SPN & Prereq Request Form

    Note that you will need a Google account in order to sing in and see the form.

    Please do not email me with individual requests.  I will be issuing SPNs and Prerequisite Overrides no earlier than 2 weeks before the start of the Spring 2018 semester.

    “Unsupervised Domain Adaptation with Copula Models” presented at MLSP’17

    Our paper “Unsupervised Domain Adaptation with Copula Models”[1] was presented this week at the IEEE Int’l Workshop on Machine Learning for Signal Processing in Tokyo, Japan.

    It was an exciting meeting; unlike the now mega-conferences of CVPR and NIPS kind, MLSP is still refreshingly small.  There were several outstanding tutorials and keynotes by Kenji Fukumizu, Shun-ichi Amari, and Yee Whye Teh, with limited emphasis on “deep.”  Well organized!

    Then, there was Tokyo itself.  Always a pleasure to visit.


    [1] C. D. Tran, O. Rudovic, and V. Pavlovic, “Unsupervised domain adaptation with copula models,” in IEEE Int’l Conf. Machine Learning for Signal Processing (MLSP), 2017.
    author = {Cuong D. Tran and Ognjen Rudovic and Vladimir Pavlovic},
    title = {Unsupervised domain adaptation with copula models},
    booktitle = {IEEE Int’l Conf. Machine Learning for Signal Processing (MLSP)},
    year = {2017},
    note = {33\% contribution.},
    date-added = {2017-01-17 15:43:46 +0000},
    date-modified = {2017-01-17 15:45:08 +0000},
    keywords = {domain adaptation, mlsp17},

    Chapter on “Machine Learning Methods for Social Signal Procesing” in Social Signal Processing book

    Social Signal Processing book by Cambridge University Press was finally published this July.   Read our chapter on “Machine Learning Methods for Social Signal Processing”[1] on p. 234 of this collection of outstanding articles.


    [1] O. Rudovic, M. Nicolaou, and V. Pavlovic, “Social Signal Processing,” , A. Vinciarelli, J. Burgoon, N. Magnenat-Thalmann, and M. Pantic, Eds., Cambridge University Press, 2017.
    Author = {Ognjen Rudovic and Mihalis Nicolaou and Vladimir Pavlovic},
    Chapter = {Machine Learning Methods for Social Signal Processing},
    Editor = {Alessandro Vinciarelli and Judee Burgoon and Nadia Magnenat-Thalmann and Maja Pantic},
    Note = {33\% contribution},
    Publisher = {Cambridge University Press},
    Title = {Social Signal Processing},
    Year = {2017}}

    Welcome to new lab members

    This fall we are welcoming two new students to our group:  Mihee Lee and Yuting Wang.   Mihee joins us from R&D at Samsung, where she worked after completing her BS in Math Ewha Woman’s University, Korea.  Yuting completed her MS at the Karlsruhe Institute of Technology in Germany and was also a visiting student at CMU.

    Please join me in welcoming Mihee and  Yuting to Rutgers and Seqam Lab.