The Art of Food: Meal Image Synthesis from Ingredients

The task is to generate a meal image given a set of ingredients

Fangda Han


In this work we propose a new computational framework, based on generative deep models, for synthesis of photo-realistic food meal images from textual descriptions of its ingredients. Previous works on synthesis of images from text typically rely on pre-trained text models to extract text features, followed by a generative neural network aimed to generate realistic images conditioned on the text features. These works mainly focus on generating spatially compact and well-defined categories of objects, such as birds or flowers. In contrast, meal images are significantly more complex, consisting of multiple ingredients whose appearance and spatial qualities are further modified by cooking methods.

We propose a method that first builds an attention-based ingredients-image association model, which is then used to condition a generative neural network tasked with synthesizing meal images. Furthermore, a cycle-consistent constraint is added to further improve image quality and control appearance. Extensive experiments show our model is able to generate meal image corresponding to the ingredients, which could be used to augment existing dataset for solving other computational food analysis problems.


Fangda Han, Ricardo Guerrero, Vladimir Pavlovic.

The Art of Food: Meal Image Synthesis from Ingredients.


To generate meal images from ingredients, we propose a two-step solution.

  • Train a recipe association model to find a shared latent space between ingredient sets and images
  • Use the latent representation of ingredients to train a GAN to synthesize meal images conditioned on those ingredients.

Step 1. Attention-based Cross-Modal Association Model

Attention-based Cross-Modal Association Model

The association model takes two modals — an ingredient set (ingr1, ingr2, ing3, …) and an image — as inputs and minimize their distance in the shared FoodSpace if the modals come from the same recipes, and maximize their distance otherwise.

Loss function of the association model

By doing so we get effective Ingredients Encoder and Image Encoder. These encoders are fully trained and applied in the second step.

Step 2. Generative Meal Image Network

Generative Meal Image Network

The generative adversarial network takes the ingredients as input and generates the corresponding meal image. We build upon StackGAN-v2~[1], which contains three branches stacked together.

Loss function of the generative model

Cycle-consistency Constraint

A correctly-generated meal image should “contain” the ingredients it is conditioned on. Thus, a cycle-consistency term is introduced to keep the fake image contextually similar, in terms of ingredients, to the corresponding real image in FoodSpace.

Specifically, for a real image and the corresponding generated fake image, the cycle-consistency regularization aims at minimizing the cosine distance at different scales. This loss term is also shown at the bottom-right in the previous generative meal image network figure.



The dataset we use is based on Recipe1M [], which contains more than one million recipes. We only use those recipes with at least one images (around 400k recipes) and split them into training, validating and testing dataset.

The original training dataset contains about 16k ingredient names, however some share similar meaning and could be potentially merged. We first select about 4k ingredient names by frequency and then train a word2vec model with all information including recipe instructions, title and ingredients, which is used to cluster the 4k ingredient names in the vector space, these proposed clusters are further confirmed by human annotators and finally get a canonical ingredient list of size ~2k. This semi-automatic process can be illustrated in the figure below.

Evaluate of Association Model

The table below shows a comparison between the baseline and ours with the canonical ingredient list. We not only achieve better result in the same task (e.g. 5K), even our performance on a larger retrieval range (e.g. 10K) is better than theirs on 5K.

Below we show some retrieved samples from our model. As can be seen, although the correct one is not always on the top (e.g. the green box), the model could still retrieve quite relative images given the query ingredients.

Meal Image Synthesis

As for the image synthesize, we compare with the baseline (e.g. StackGAN-v2), which was published in 2017. The result shows that our model with the cycle-consistency regularization achieves better result on most subsets.

We also investigate the median rank (MedR) by using synthesized images as the query to retrieve recipes with the association model trained in step 1. Our method outperforms StackGAN-v2 on most subsets, indicating both the utility of the ingredient cycle-consistency and the embedding model. Still, the generated images remain apart from the real images in their retrieval ability, affirming the extreme difficulty of the photo-realistic meal image synthesis task.

The figure below shows examples generated from different subsets. Within each category, the generated images capture the main ingredients for different recipes. Compared with StackGAN-v2, the images generated by our model usually have more clear ingredients appearance and looks more like the real image, which again shows the benefit of the cycle-consistency constraint.

Below shows generated images by interpolating between two ingredient lists in the FoodSpace with and without tomato (resp. blueberry). One can observe that the model gradually removes the target ingredient during the interpolation-based removal process.


In this paper, we develop a model for generating photo-realistic meal images based on sets of ingredients. We integrate the attention-based recipe association model with StackGAN-v2, aiming for the association model to yield the ingredients feature close to the real meal image in FoodSpace, with StackGAN-v2 attempting to reproduce this image class from the FoodSpace encoding. To improve the quality of generated images, we reuse the image encoder in the association model and design an ingredient cycle-consistency regularization term in the shared space. Finally, we demonstrate that processing the ingredients into a canonical vocabulary is a critical key step in the synthesis process.

Experimental results demonstrate that our model is able to synthesize natural-looking meal images corresponding to desired ingredients, both visually and quantitatively, through retrieval metrics.

In the future, we aim at adding additional information including recipe instructions and titles to further contextualize the factors such as the meal preparation, as well as combining the amount of each ingredient to synthesize images with arbitrary ingredients quantities.


1. Chen, Jing-Jing et al: Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval, 2018 ACM Multimedia Conference on Multimedia Conference
2. Zhang, Han et al: Stackgan++: Realistic image synthesis with stacked generative adversarial networks, arXiv preprint arXiv:1710.10916
3. Salvador, Amaia et al.: Learning cross-modal embeddings for cooking recipes and food images. Proceedings of the IEEE conference on computer vision and pattern recognition (2017)


    cs535 special permission or prerequisite override requests for Fall 2018

    If you are interested in registering for my Fall 2018  cs535 Pattern Recognition and do not have the prerequisites or for some other reason need a Special Permission (SPN) or Prerequisite Override, you will need to fill out a request here:

    cs535 Fall 2018 SPN & Prereq Request Form

    Note that you will need a Google account in order to sing in and see the form.

    Please do not email me with individual requests.  I will be issuing SPNs and Prerequisite Overrides no earlier than 2 weeks before the start of the Fall 2018 semester.

    Welcome to new lab members

    This fall we are welcoming two new students to our group:  Mihee Lee and Yuting Wang.   Mihee joins us from R&D at Samsung, where she worked after completing her BS in Math Ewha Woman’s University, Korea.  Yuting completed her MS at the Karlsruhe Institute of Technology in Germany and was also a visiting student at CMU.

    Please join me in welcoming Mihee and  Yuting to Rutgers and Seqam Lab.

    Distributed Probabilistic Learning

    1. Abstract

    Traditional computer vision algorithms, particularly those that exploit various probabilistic and learning-based approaches, are often formulated in centralized settings. However, modern computational settings are becoming increasingly characterized by networks of peer-to-peer connected devices, with local data processing abilities. A number of distributed algorithms have been proposed to address the problems such as calibration, pose estimation, tracking, object and activity recognition in large camera networks [1],[2].

    One critical challenge in distributed data analysis includes dealing with missing data. In camera networks, different nodes will only have access to a partial set of data features because of varying camera views or object movement. For instance, object points used for SfM may be visible only in some cameras and only in particular object poses. As a consequence, different nodes will be frequently exposed to missing data. However, most current distributed data analysis methods are algebraic in nature and cannot seamlessly handle such missing data.

    In this work we present an approach to estimation and learning of generative probabilistic models in a distributed context where certain sensor data can be missing. In particular, we show how traditional centralized models, such as probabilistic PCA (PPCA) [3] and missing-data PPCA [4], Bayesian PCA (BPCA) [4] can be learned when the data is distributed across a network of sensors. We demonstrate the utility of this approach on the problem of distributed affine structure from motion. Our experiments suggest that the accuracy of the learned probabilistic structure and motion models rivals that of traditional centralized factorization methods while being able to handle challenging situations such as missing or noisy observations.

    2. Distributed Probabilistic Learning

    We propose a distributed consensus learning approach for parametric probabilistic models with latent variables that can effectively deal with missing data. The goal of the network of sensors is to learn a single consensus probabilistic model (e.g., 3D object structure) without ever resorting to a centralized data pooling and centralized computation. Let \( \mathbf{X} = \{ \mathbf{x}_{n} | \mathbf{x}_{n} \in \mathcal{R}^{D} \} \) be a set of iid multivariate data points with the corresponding latent variables  \( \mathbf{Z} = \{ \mathbf{z}_{n} | \mathbf{z}_{n} \in \mathcal{R}^{M} \} \), \(n = 1 … N\). Our model is a joint density defined on \( (\mathbf{x}_{n}, \mathbf{z}_{n}) \) with a global parameter \( \theta \),

    (\mathbf{x}_{n}, \mathbf{z}_{n}) \sim p(\mathbf{x}_{n}, \mathbf{z}_{n} | \theta),

    with \( p(\mathbf{X}, \mathbf{Z} | \theta) = \prod_n p(\mathbf{x}_{n}, \mathbf{z}_{n} | \theta) \), as depicted in Figure 1-a. In this general model, we can find an optimal global parameter \( \hat{\theta} \) (in a MAP sense) by applying standard EM learning. It is important to point out that each posterior density estimate at point \( n \) depends solely on the corresponding measurement \( \mathbf{x}_{n} \) and does not depend on any other \( \mathbf{x}_{k}, k \neq n \), hence is decentralized. To consider the distributed counterpart of this model, let \( G = (V, E) \) be an undirected connected graph with vertices \( i, j \in V \) and edges \( e_{ij} = (i, j) \in E \) connecting the two vertices. Each \( i \)-th node is directly connected with 1-hop neighbors in \( \mathcal{B}_{i} = \{ j; e_{ij} \in E \} \). Suppose the set of data samples at \( i \)-th node is \( \mathbf{X}_{i} = \{ \mathbf{x}_{in}; n = 1, … , N_{i} \} \), where \( \mathbf{x}_{in} \in \mathcal{R}^{D} \) is \( n \)-th measurement vector and \( N_{i} \) is the number of samples collected in \( i \)-th node. Likewise, we define the latent variable set for node \( i \) as \( \mathbf{Z}_{i} = \{ \mathbf{z}_{in}; n = 1, … , N_{i} \} \).

    Learning the model parameter would be decentralized if each node had its own independent parameter \(\theta_i\). Still, the centralized model can be equivalently defined using the set of local parameters, with an additional constraint on their consensus, \( \theta_1 = \theta_2 = \cdots = \theta_{|V|} \). This is illustrated in Figure 1-b where the local node models are constrained using ties defined on the underlying graph. The simple consensus tying can be more conveniently defined using a set of auxiliary variables \( \rho_{ij} \), one for each edge \( e_{ij} \) (Figure 1-c). This now leads to the final distributed consensus learning formulation, similar to [5]:

    \begin{align*} \label{dpm_opt1} \hat{\mathbf{\theta}} = \arg\min_{ \{ \theta_{i} : i \in V \} } & -\log p( \mathbf{X} | \mathbf{\theta}, G) \\ s.t. &\quad \theta_{i} = \rho_{ij}, \rho_{ij} = \theta_{j}, i \in V, j \in \mathcal{B}_{i}. \end{align*}

    This is a constrained optimization task that can be solved in a principal manner using the Alternating Direction Method of Multipliers (ADMM) [6]. ADMM iteratively, in a block-coordinate fashion, solves \( \max_{\lambda} \min_{\theta} \mathcal{L}(\cdot) \) on the augmented Lagrangian

    \begin{align*} \label{dpm_opt2} \mathcal{L}( \mathbf{\theta}, \rho, \lambda ) &= -\log p( \mathbf{X} | \theta_{1}, \theta_{2}, … , \theta_{|V|}, G) \\ &\quad + \sum_{i \in V} \sum_{j \in \mathcal{B}_{i}} \left\{ \lambda_{ij1}^{\text{T}} ( \theta_{i} – \rho_{ij} ) + \lambda_{ij2}^{\text{T}} ( \rho_{ij} – \theta_{j} ) \right\} \nonumber \\ &\quad + \frac{ \eta }{ 2 } \sum_{i \in V} \sum_{j \in \mathcal{B}_{i}} \left\{ || \theta_{i} – \rho_{ij} ||^{2} + || \rho_{ij} – \theta_{j} ||^{2} \right\} \end{align*}

    where \( \lambda_{ij1}, \lambda_{ij2}, i,j \in V \) are the Lagrange multipliers, \( \eta \) is some positive scalar parameter and \( ||\cdot|| \) is induced norm. The last term (modulated by \( \eta \) ) is not strictly necessary for consensus but introduces additional regularization.


    3. Distributed Probabilistic Principal Component Analysis (D-PPCA)

    Distributed versions of PPCA and missing-data PPCA can be derived straightforwardly based on the model above. Detailed information, including derivation of iterative formula for distributed EM [5] can be found in Yoon and Pavlovic (2012).

    We tested D-PPCA using synthetic Gaussian data including the case when some of the values are missing. As one can see in Figure 2, D-PPCA, regardless of existence of missing data (either missing-at-random (MAR) or missing-not-at-random (MNAR)), showed competence against the centralized counterpart. We also report empirical convergence analysis in the supplementary material.

    We also applied D-PPCA to the problem of distributed affine SfM. We conducted experiments on both synthetic (multiple cameras observing a rotating cube) and real settings. For real settings, we used videos obtained from Caltech [7] and Johns Hopkins [8]. We simulated multiple camera setting by sequentially dividing frames by the number of cameras, in our case 5, i.e. frame no. 1~6 are assigned to camera 1, 7~12 are assigned to camera 2, etc. assuming we have 5 cameras and 30 frames in total. We compared D-PPCA reconstructed structure with centralized, SVD-based reconstructed structure by using subspace angle. Table 1 below shows the result we obtained from Caltech turntable dataset. It clearly shows that D-PPCA rivals that of SVD-based methods even when some values in observation are missing. For detailed and additional results and explanation, please refer the attached manuscript and supplementary materials.

    4. Distributed Bayesian Principal Component Analysis (D-BPCA)

    We can also apply the similar framework to obtain distributed extension of the mean field variational inference formulation (MFVI). It is easy to show that the distributed counterpart is equivalent to the centralized mean field variational inference optimization problem:

    \begin{align*}\label{ooo} [\hat{\lambda}_Z, \hat{\lambda}_W] = &\underset{\lambda_{Z_i}, \lambda_{W_i}: i \in V}{\arg\min}\; – \mathbb{E}_Q\big[ \log P(X,Z,W|\Omega_z,\Omega_w) \big] + \mathbb{E}_Q[\log Q] \nonumber \\ &s.t. \;\; \lambda_{W_i} = \rho_{ij},\;\;\rho_{ij} = \lambda_{W_j},\;\; i\in V, j\in \mathcal{B}_i \end{align*}

    where \( Z=\{z_i \in \mathbb{R}^{M}\}_{i=1}^{N} \) denote a set of local latent variables,  \( W \) denotes a global latent variable and \( \Omega=[\Omega_z, \Omega_w] \) denote a set of fixed parameters, and the form of \( Q(z_n;\lambda_{z_n}) \) and \( Q(W;\lambda_{W}) \) are set to be in the same exponential family as the conditional distributions \( P(W|X,Z,\Omega_w) \), Using conjugate exponential family for prior and likelihood distributions, each coordinate descent update in MFVI can be done in closed form. However, the penalty terms would be quadratic in the norm difference of \( (\lambda_{W_i} – \rho_{ij}) \), that may result in the non-analytic updates for \( \{\lambda_{W_i}\}_{i=1}^{|V|} \)

    To solve the above problem efficiently, we propose to use Bregman ADMM (B-ADMM) [9] which generalizes the ADMM by replacing the quadratic penalty term by different Bregman divergences in order to exploit the structure of problems. We propose to use the log partition function of the global parameter as the bregman function. Based on the proposed Bregman function, we can obtain the analytical update formula for BADMM, which have closed form solutions. Figure below shows the performance comparison of the distributed affine structure from motion experiment explained above using the centralized versions of SVD, PPCA and BPCA (using variational inference) and the proposed distributed versions of PPCA and BPCA with varying noise levels.

    Related Publications

    • S. Yoon and V. Pavlovic, “Decentralized Probabilistic Learning For Sensor Networks,” in IEEE Global Conference on Signal and Information Processing, 2016.
      author = {Sejong Yoon and Vladimir Pavlovic},
      title = {Decentralized Probabilistic Learning For Sensor Networks},
      booktitle = {IEEE Global Conference on Signal and Information Processing},
      year = {2016},
      month = dec,
      note = {50\% contribution.},
      date-added = {2016-09-11 21:34:27 +0000},
      date-modified = {2016-09-11 21:35:57 +0000},

    • C. Song, S. Yoon, and V. Pavlovic, “Fast ADMM Algorithm for Distributed Optimization with Adaptive Penalty,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, {USA.}, 2016, p. 753–759.
      [BibTeX] [Download PDF]
      author = {Changkyu Song and Sejong Yoon and Vladimir Pavlovic},
      title = {Fast {ADMM} Algorithm for Distributed Optimization with Adaptive Penalty},
      booktitle = {Proceedings of the Thirtieth {AAAI} Conference on Artificial Intelligence},
      year = {2016},
      pages = {753--759},
      address = {Phoenix, Arizona, {USA.}},
      month = feb,
      note = {33\% contribution},
      bdsk-url-1 = {},
      date-modified = {2016-09-11 21:16:22 +0000},
      url = {},

    • B. Babagholami, S. Yoon, and V. Pavlovic, “D-MFVI: Distributed Mean Field Variational Inference using Bregman ADMM,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, {USA.}, 2016, p. 1582–158.
      [BibTeX] [Download PDF]
      author = {Behnam Babagholami and Sejong Yoon and Vladimir Pavlovic},
      title = {{D-MFVI}: Distributed Mean Field Variational Inference using Bregman {ADMM}},
      booktitle = {Proceedings of the Thirtieth {AAAI} Conference on Artificial Intelligence},
      year = {2016},
      pages = {1582--158},
      address = {Phoenix, Arizona, {USA.}},
      month = feb,
      note = {33\% contribution},
      bdsk-url-1 = {},
      date-modified = {2016-09-11 21:16:12 +0000},
      url = {},

    • S. Yoon and V. Pavlovic, “Distributed Probabilistic Learning for Camera Networks with Missing Data,” in Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., 2012, p. 2933–2941.
      [BibTeX] [Download PDF]
      author = {Sejong Yoon and Vladimir Pavlovic},
      title = {Distributed Probabilistic Learning for Camera Networks with Missing Data},
      booktitle = {Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States.},
      year = {2012},
      pages = {2933--2941},
      note = {50\% contribution},
      bdsk-url-1 = {},
      url = {},


    1. R. J. Radke. “A Survey of Distributed Computer Vision Algorithms”. Nakashima, Hideyuki, Aghajan, Hamid, Augusto and J. Carlos eds. Springer Science+Business Media, LLC. 2010.
    2. R. Tron and R. Vidal. “Distributed Computer Vision Algorithms”, IEEE Signal Processing Magazine, Vol. 28. 2011, pp. 32-45.
    3. M. E. Tipping and C. M. Bishop. “Probabilistic Principal Component Analysis”, Journal of the Royal Statistical Society, Vol. Series B. 1999, pp. 611-622.
    4. A. Ilin and T. Raiko. “Practical Approaches to Principal Component Analysis in the Presence of Missing Values”, Journal of Machine Learning Research, Vol. 11. 2010, pp. 1957-2000.
    5. P. A. Forero, A. Cano and G. B. Giannakis. “Distributed clustering using wireless sensor networks”, IEEE Journal of Selected Topics in Signal Processing, Vol. 5, August, 2011, pp. 707-724.
    6. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, vol. 3, Now Publishers, 2011.
    7. P. Moreels and P. Perona. “Evaluation of Features Detectors and Descriptors based on 3D Objects”, International Journal of Computer Vision, Vol. 73. 2007, pp. 263-284.
    8. R. Tron and R. Vidal. “A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms“, IEEE International Conference on Computer Vision and Pattern Recognition. 2007.
    9. H. Wang and A. Banerjee, “Bregman Alternating Direction Method of Multipliers”, Advances in Neural Information Processing Systems 27, 2014.

    Recognition of Ancient Roman Coins using Spatial Information

    Recognition of Ancient Roman Coins

    1. Problem Formulation

    For a given Roman coin image, the goal is to recognize who is on the coin


    There are thousand of different ways to define the Roman coins. For example, we can classify the coins by attributes such as symbols, sizes, materials and legend. Please note that those attributes are correlated together. One attribute may help reveal the other attributes. In this project, we focus on a face recognition problem where for a given Roman coin image, the goal is to recognize who is on the coin. So, for above images, we want to know that the Roman emperor, Caligular, is engraved on the coin. This is for Maximus second, the famous emperors Nero and Tiberius.


    2. Motivation

    Understanding the ancient Roman coins could serve as references to understand the Roman empire

    • The Roman coins are always connected to Roman historical events and Roman imperial propaganda
    • The Roman empire knew how to effectively use the coin as their political propaganda
    • The Roman coins were widely used to convey the achievements of Roman emperors to public
    • The Roman coins were served to spread messages of changing policies or merits through the empire
    • The Roman emperors also could show themselves to the entire empire by engraving portraits on the coins
    • The Roman coins were the newspaper of the Roman empire

    3. Practical Application

    A reliable and automatic method to recognize the coins is necessary

    • The coin market is very active as many people are collecting coins as hobby. Also the coins were massively produced and new Roman coins are daily excavated, making themselves affordable to collect.
    • Ancient coins are becoming subject to a very large illicit trade. Recognition of the ancient Roman coins is not easy for novices but requires knowledge.
    • <!–

    • A traditional way is to periodically and manually search catalogue, dealers or the Internet by authority forces.
    • –>

    4. Challenges

    • Inter-class similarity due to engraver’s lack of knowledge for the emperor’s portrait and abstraction
    • Intra-class dissimilarity. The coins were made manually from different factories
    • The recognition of the face on the coin is different from that of the real face

    5. Coin Data Collection

    • Coin images are collected from a numismatic website [1, 2]
    • 2815 coin images with 15 Roman emperors
    • – Small part of the much larger dataset
      – Annotated for visual analysis (the original dataset only has numismatic annotation)
      – Each emperor has at least 10 coin images

    • High resolution images : 350-by-350 pixels

    6. Coin Recognition Methods using Spatial Information

    1. Deformable Part Model (DPM) based method
    2. – Precise encoding of spatial information more specifically than spatial pyramid by alignment
      – DPM is used to align the coin image by locating the face of the emperor
      – Training and test of DPM

    3. Fisher Vector based method
    4. – Each point is presented as a combination of visual features and location, (x, l)
      – Gaussian mixture model to describe probability of (x, l)
      \[ \begin{eqnarray} p(\mathbf{x}, \mathbf{l}) & = & \sum_k \pi_k \cdot p(\mathbf{x}, \mathbf{l}; {\Sigma}_k^V, {\Sigma}_k^L, {\mu}_k^V, {\mu}_k^L) \nonumber \\ & = & \sum_k \pi_k \cdot p(\mathbf{x}; {\Sigma}_k^V, {\mu}_k^V) \cdot p(\mathbf{l}; {\Sigma}_k^L, {\mu}_k^L), \end{eqnarray} \] where \(\pi_k\) is a prior probability for the \(k\)th component, \({\Sigma}_k^V, {\mu}_k^V\) are means and covariances for the visual descriptors, \({\Sigma}_k^L, {\mu}_k^L\) mean and covariance for the location, and \[ \begin{eqnarray} p(\mathbf{x}; {\Sigma}_k^V, {\mu}_k^V) & \quad \sim \quad & \mathcal{N} (\mathbf{x}; {\Sigma}_k^V, {\mu}_k^V)\\ p(\mathbf{l}; {\Sigma}_k^L, {\mu}_k^L) & \quad \sim \quad & \mathcal{N} (\mathbf{l}; {\Sigma}_k^L, {\mu}_k^L). \end{eqnarray} \] The gradient with respect to the mu and sigma defines the Fisher vector.

    7. Experimental Results

    • Experimental settings
    • – 2815 coin images with 15 emperors
      – For evaluation, divide the coin dataset into 5 fold splits, training on 4 splits and testing on 1 split
      – SIFT as visual feature
      – Multi-class SVM for training and prediction

    • Recognition accuracies for various methods
    • Confusion matrices
    • Discriminative regions
    • Outlier detection

    8. Conclusion

    We proposed two automatic methods to recognize the ancient Roman coins. The first method employs the deformable part model to align the coin images to improve the recognition accuracy. The second method facilitates the spatial information of the coin by directly encoding the location information. As the first method takes the information of the face location into account, it performs slightly better than the second method. The experiments show that both methods outperform the other methods such as the standard spatial pyramid model and human face recognition method.

    In this project, we collect a new ancient Roman coin dataset and investigate an automatic framework to recognize the coins where we employ the state-of-the-art face recognition system and facilitate the spatial information of the coin to improve the recognition accuracy. The coin images are high-resolution (350-by-350 pixels) and the face locations are annotated. While the proposed coin recognition framework is based on the standard methods such as bag-of-words with spatial pyramids, Fisher vectors and DPM, we believe that their use in the context of the ancient coin recognition represents an interesting contribution.


    • [1] J. Kim and V. Pavlovic. “Ancient Coin Recognition Based on Spatial Coding”. Proc. International Conference on Pattern Recognition (ICPR). 2014.
    • [2] J. Kim and V. Pavlovic. “Improving Ancient Roman Coin Recognition with Alignment and Spatial Encoding”. ECCV Workshop VISART. 2014.

    Hybrid On-line 3D Face and Facial Actions Tracking in RGBD Video Sequences

    1. Abstract

    Tracking human faces has remained an active research area among the computer vision community for a long time due to its usefulness in a number of applications, such as video surveillance, expression analysis and human-computer interaction. An automatic vision-based tracking system is desirable and such a system should be capable of recovering the head pose and facial features, or facial actions. It is a non-trivial task because of the highly deformable nature of faces and their rich variability in appearances.

    A popular approach for face modeling and alignment is using statistical models such as Active Shape Models and Active Appearance Models. These techniques have been refined over long period of time and proven to be really robust. However, they were originally developed to work on 2D texture and require intensive preparation of training data. Using 3D morphable model on the other hand is another approach. In these techniques, a 3D facial shape model is deformed to fit to input data. These trackers rely on either texture or depth, not taking advantages of both sources of information or using them sparsely. In addition, sophisticated trackers use specially designed 3D face models which are not freely available. Lastly, they often require prior training or manual initial alignment of the face model performed by human operators.

    In this work, we propose a hybrid on-line 3D face tracker to take advantages of both texture and depth information, which is capable of tracking 3D head pose and facial actions simultaneously. First, we employ a generic deformable model, the Candide-3, into our ICP fitting framework. Second, we introduce a strategy to automatically initialize the tracker using the depth information. Lastly, we propose a hybrid tracking framework that combines ICP and OAM to utilize the strengths of both techniques. The ICP algorithm, which is aided by optical flow to correctly follow large head movement, robustly tracks the head pose across frames using depth information. It provides a good initialization for OAM. In return, the OAM algorithm maintains the texture model of the face, adjusts any drifting incurred by ICP and transforms the 3D shape closer to correct deformation, which then provides ICP with a good initialization in the next frame.

    2. Parameterized Face Model

    We use an off-the-shell 3D deformable model, Candide-3, which was developed by J. Ahlberg [1]. The deformation of the face model is controlled by Shape Units (SUs) which represent face biometry specific to a person, and Action Units (AUs) which control facial expressions and are user-invariant. Since every vertex can be transformed independently, each vertex of the model is reshaped according to: \[g = p_0 + S\sigma + A\alpha \] where $p_0$ is the base coordinates of a vertex {\it p}, S and A are shape and action deformation matrices associated with vertex {\it p}, respectively. $\sigma$ is the vector of shape deformation parameters and $\alpha$ is the action deformation parameters vector. In general, the transformation of a vertex given global motion including rotation {\it R} and translation {\it t} is defined as: \[p’ = R(p_0 + S\sigma + A\alpha ) + t \]

    We use the first frame to estimate the SU parameters corresponding to the test subject in neutral expression, together with initial head pose. From the second frame onwards , we keep shape unit parameters $\sigma$ unchanged and track the action unit parameters $\alpha$, along with head pose {\it R} and {\it t}. 7 action units are tracked in our framework as depicted below.

    3. Initialization

    The initialization pipeline is described in the following figure:

    First, using a general 2D face alignment algorithm, we can reliably detect 6 features points (eye/mouth corners) as shown below

    These 2D points are back-projected to world coordnates to form a set of 3D correspondences using the depth map. Then using the registration technique in [2], we recover the initial head pose. We use some heuristics to guess initial shape parameters by searching for facial parts (nose, chin). Lastly, we jointly optimize pose and shape unit parameters by minimizing the the following ICP energy:

    \over R} ,\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}
    \over t} ,\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}
    \over \sigma }  = \mathop {\arg \min }\limits_{R,t,\sigma } \sum\limits_{i = 1}^N {{{\left\| {R({p_0} + {S_i}\sigma ) + t – {d_i}} \right\|}^2}} \]

    Levenberg-Marquardt algorithm is used to solve the above non-linear least squares problem [3].

    4. Tracking

    The overall tracking process is given in the below diagram:

    The tracking process starts with minimizing the ICP energy to recover the head pose and action unit parameters. The procedure is similar to Algorithm 1, with only one change: in the first iteration, the correspondences are formed by optical flow tracking of the 2D-projected vertex features from the previous color frame to the current color frame. From the second iteration, correspondences are found by searching for closest points.

    Optical flow inherently introduces drifting into tracking, and the error accumulated over time will certainly reduce the tracking performance. Thus we incorporate On-line Appearance Model as a refinement step in our tracker using the full facial texture information while maintaining the no-training requirement.

    The On-line Appearance Model in our tracker is similar to that of [4], in which:
    -The appearance model is represented in a fixed-sized template.
    -The mean appearance is built on-line for the current user after the 1st frame
    – Each pixel in the template is modeled by an independent Gaussian distribution and thus the appearance vector is a multivariate Gaussian distribution which is updated over time:
    \[{\mu _{{i_{t + 1}}}} = \left( {1 – \alpha } \right){\mu _{{i_t}}} + \alpha {\chi _{{i_t}}} \]
    \[\sigma _{{i_{t + 1}}}^2 = \left( {1 – \alpha } \right)\sigma _{{i_t}}^2 + \alpha {\left( {{\chi _{{i_t}}} – {\mu _{{i_t}}}} \right)^2} \]

    The final transformation parameters are found by minimizing the Mahalanobis distance (u is the (R, t, α) parameters vector)
    \over u} }_t} = \mathop {\arg \min }\limits_{{u_t}} {\sum\limits_{i = 1}^n {\left( {\frac{{\chi {{({u_t})}_i} – {\mu _{{i_t}}}}}{{{\sigma _{{i_t}}}}}} \right)} ^2} \]

    5. Experiments

    5.1. Synthetic Data

    Our single-threaded C++ implementation can run at up to 16fps on a 2.3Ghz Intel Xeon CPU, unfortunately that’s not fast enough to run on live stream. We generate 446 synthetic RGBD sequences from BU-4DFE dataset [5] where the initial frames contain neutral expression, with white noise applied to the depth maps. The size of the rendered face is about 200×250 pixels.

    We compare the results of our tracker to a pure ICP-based tracker whose resulting parameters are clamped within predefined boundaries to prevent drifting. The errors shown in Table 1 do not truly reflect the superior performance of the hybrid tracker over the ICP tracker as seen in the figure.

    5.2. Real RGB-D sequences

    We capture sequences from a Kinect and a Senz3D cameras. In the Kinect sequence, the depth map is aligned to the color image, and our tracker performs really well.

    In the sequence captured from the Senz3D camera, due to the disparity between the texture and the depth map resolutions, we map the texture to the depth map instead – the generated texture thus becomes very noisy but the tracker can still works reasonably.


    • H.X. Pham and V. Pavlovic, “Hybrid On-line 3D Face and Facial Actions Tracking in RGBD Video Sequences” In: Proc. International Conference on Pattern Recognition (ICPR). (2014)


    • [1] J. Ahlberg, “An updated parameterized face” Image Coding Group,Dept. of Electrical Engineering, Linkoping University, Tech. Rep.
    • [2] K. S. Arun, T. S. Huang, and S. D. Blostein, “Least-squares fitting of two 3d point sets,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, no. 5, pp. 698–700, 1987.
    • [3] A. W. Fitzgibbon, “Robust registration of 2d and 3d point sets,” Image and Vis. Comput., no. 21(13-14), pp. 1145–1153, 2003.
    • [4] F. Dornaika and J. Orozco, “Real-time 3d face and facial feature tracking,” J. Real-time Image Proc., pp. 35–44, 2007.
    • [5] L. Yin, X. Chen, Y. Sun, T. Worm and M. Reale, “A High-Resolution 3D Dynamic Expression Database”, in IEEE FG’08, 2008.

    Depth Recovery With Face Priors

    *Chongyu Chen was with Nanyang Technological University*

    1. Abstract.

    Existing depth recovery methods for commodity RGB-D sensors primarily rely on low-level information for repairing the measured depth estimates. However, as the distance of the scene from the camera increases, the recovered depth estimates become increasingly unreliable. The human face is often a primary subject in the captured RGB-D data in applications such as the video conference. In this work we propose to incorporate face priors extracted from a general sparse 3D face model into the depth recovery process. In particular, we propose a joint optimization framework that consists of two main steps: deforming the face model for better alignment and applying face priors for improved depth recovery. The two main steps are iteratively and alternatively operated so as to help each other. Evaluations on benchmark datasets demonstrate that the proposed method with face priors significantly outperforms the baseline method that does not use face priors, with up to 15.1% improvement in depth recovery quality and up to 22.3% in registration accuracy.

    2. The proposed method.

    Given a color image I and its corresponding (aligned) noisy depth map Z as input, our goal is to obtain a good depth map of the face region using the face priors derived from the general 3D deformable model. The pipeline of the proposed method is shown below.

    The first two components are pre-processing steps to roughly clean up the depth data and roughly align the general face model to the input point cloud. The last two components are the core of our proposed framework. For component of the guided depth recovery, we fix the face prior and use it to update the depth, while for the last component, we fix the depth and update the face prior. The last two components alternatively and iteratively operate until convergence.

    Based on [1], the energy function is formulated as
    \[\min\limits_{U, u} E_r(U) + \lambda_d E_d(U) + \lambda_f E_f(U, u)\]

    The first two terms are similar to [1]. The last term is the new face prior term, which is defined as following:
    \[E_f(U, u) = \sum_{i \in \Omega_f} \eta_i \left( U(i) – T_f( P(u), i) \right)^2\]

    U represents the depth map to be recovered, while u represents the parameters of the 3D facial deformable model and Tf is the facial shape transformation function according to u. For more details on the deformation of the face model, please refer to our previous project [2].

    Considering that the guidance from the sparse vertices of the Candide model may be too weak to serve as the prior for the full (dense) depth map U, we need to generate a dense synthetic depth map Y from the aligned face prior P(u) using an interpolation process. It is possible to define different interpolation functions according to desired dense surface properties. In computer graphics, such models may use non-uniform rational basis spline (NURBS) to guarantee surface smoothness. Here, for the purpose of a shape prior we choose a simple piece-wise linear interpolation. This process is denoted as
    \[Y = \text{lerp}( P(u) )\]
    , which is demonstrated in the figure

    To mitigate the effects of the piece-wise flat dense patches due to the linear interpolation, we introduce a weighting scheme defined through weights ηi. In particular, for each pixel Y(i), we use a normalized weight that is adaptive to the pixel’s distances from the neighboring vertices of the sparse shape P. Let (ai,bi,ci) be the barycentric coordinates of pixel i inside a triangle defined by its three neighboring vertices of P. Then, its weight is computed as
    \[\eta_i = \sqrt{a_i^2+b_i^2+c_i^2}, \ i \in \Omega_f.\]
    This suggests that the pixels corresponding to model vertices have the highest weight of $1$ while the weights decline towards the center of each triangular patch. An illustration of the weights is given in the figure below, where bright pixels represent large weights.

    3. Energy optimization.

    From the definition of the energy function, it can be seen that the overall optimization of U remains a convex task, for a given fixed prior P. However, the optimization of the face model parameter set u might not be convex since it involves rigid and non-rigid deformation. Therefore, to tackle the global optimization task which includes both the depth U and the deformation u recovery, we resort to a standard recursive alternate optimization process. In other words, we will first optimize u while keeping U fixed, and then optimize U for the fixed deformation u. Specifically, we divide our problem into three well studied subproblems: depth recovery, rigid registration, and non-rigid deformation. The algorithm is detailed below:

    4. Experiments.
    a. Synthetic data

    We first use the BU4D Facial Expression Database [3] for quantitative evaluation. Considering that Kinect is the most popular commodity RGB-D sensor, we add some Kinect-like artifacts to the depth maps generated from the BU4D database.

    We rendered the BU4DFE data to different distances: 1.2m, 1.5m, 1.75m and 2m. By using synthetic data,
    we are able to obtain the ground truth for quantitative evaluation
    . Specifically, we measure the depth recovery
    performance in terms of average pixel-wise Mean Absolute Error (MAE) in mm. The following plot compare the
    results of our method to the baseline [1].

    Besides the recovery error, we also evaluate the registration accuracy. To get the reference registration and shapes, we fit the 3D face model to noise-free data. The face model is also fitted to the depth maps obtained by different methods. We then compare the fitting result with the reference registration. Table 1 shows that the proposed method produces a more accurate face registration compared to the baseline method, especially in the eyes’ region and around the face boundary.

    Samples at 1.75m

    Samples at 2m

    Blendshape-based 3D Face Tracking

    A result sample of our 3D face tracker, where (a) shows the 3D landmarks projected onto image plane, (b,c) show the 3D blendshape model and the input point cloud, and (d) shows the skinned 3D shape.

    1. Abstract.

    We introduce a novel robust hybrid 3D face tracking framework from RGBD video streams, which is capable of tracking head pose and facial actions without pre-calibration or intervention from a user. In particular, we emphasize on improving the tracking performance in instances where the tracked subject is at a large distance from the cameras, and the quality of point cloud deteriorates severely. This is accomplished by the combination of a flexible 3D shape regressor and the joint 2D+3D optimization on shape parameters. Our approach fits facial blendshapes to the point cloud of the human head, while being driven by an efficient and rapid 3D shape regressor trained on generic RGB datasets. As an on-line tracking system, the identity of the unknown user is adapted on-the-fly resulting in improved 3D model reconstruction and consequently better tracking performance. The result is a robust RGBD face tracker, capable of handling a wide range of target scene depths, beyond those that can be afforded by traditional depth or RGB face trackers. Lastly, since the blendshape is not able to accurately recover the real facial shape, we use the tracked 3D face model as a prior in a novel filtering process to further refine the depth map for use in other tasks, such as 3D reconstruction.

    2. The tracking framework.

    In this work, we use the blendshape model from FaceWarehouse database.

    The figure shows the pipeline of the proposed face tracking framework, which follows a coarse-to-fine multi-stage optimization design. In particular, our framework consists of two major stages: shape regression and shape refinement. The shape regressor performs the first optimization stage, which is learned from training data, to quickly estimate shape parameters from the RGB frame. Then, in the second stage, a carefully designed optimization is performed on both the 2D image and the available 3D point cloud data to refine the shape parameters, and finally the identity parameters are updated to improve shape fitting to the input RGBD data.

    The 3D shape regressor is the key component to achieve our goal of 3D tracking at large distance, where quality of the depth map is often poor. Unlike the existing RGBD-based face tracking works, which either heavily rely on the accurate input point cloud (at close distances) to model shape transformation by ICP or use off-the-shelf 2D face tracker to guide the shape transformation, we predict the 3D shape parameters directly from the RGB frame by the developed 3D regressor. This is motivated by the success of the 3D shape regression from RGB images. The approach is especially meaningful for our considered large distance scenarios, where the depth quality is poor. Thus, we do not make use of the depth information in the 3D shape regression to avoid profusion of inaccuracies from the depth map.

    Initially, a color frame I is passed through the regressor to recover the shape parameters θ. The projection of the Nl landmarks vertices of the 3D shape to image plane typically does not accurately match the 2D landmarks annotated in the training data. We therefore include 2D displacements D into the parameter set and define a new global shape parameter set P = ({θ},D) = (R,T,e,D). The advantages of including D in P are two-fold. First, it helps train the regressor to reproduce the landmarks in the test image similar to those in the training set. Second, it prepares the regressor to work with unseen identity which does not appear in the training set. In such case the displacement error D may be large to compensate for the difference in identities. The regression process can be expressed as \[P^{out} = {f_{r}}(I,P^{in})\], where fr is the regression function, I is the current frame, Pin and Pout are the input (from the shape regression for the previous frame) and output shape parameter sets, respectively. The coarse estimates Pout are refined further in the next stage, using more precise energy optimization added with depth information. Specifically, \[ \theta = (R,T,e)\] are optimized w.r.t both the 2D prior constraints provided by the estimated 2D landmarks by the shape regressor and the 3D point cloud. Lastly, the identity vector wid is re-estimated given the current transformation. (For more details, please refer to our manuscript on arXiv).

    The effect of using depth data for regularization: (a,b) without depth data; (c,d) with depth data

    Identity adaptation:

    3. Tracking results.

    – On BU4DFE dataset

    – On real RGBD sequences

    with occlusion:

    4. Depth recovery using dense shape priors.

    Based on our previous work, we replace the sparse Candide face model with blendshape and develop the depth recovery process as a filter on depth map.

    A result sample on real data at 2m: (a) the prior, (b) the raw depth data (c) filtered without prior (d) filtered with prior


    • H. X. Pham, C. Chen, L. N. Dao, V. Pavlovic, J. Cai and T.-J. Cham. “Robust Performance-driven 3D Face Tracking in Long Range Depth Scenes”. 2015