cs535 special permission or prerequisite override requests for Fall 2018

If you are interested in registering for my Fall 2018  cs535 Pattern Recognition and do not have the prerequisites or for some other reason need a Special Permission (SPN) or Prerequisite Override, you will need to fill out a request here:

cs535 Fall 2018 SPN & Prereq Request Form

Note that you will need a Google account in order to sing in and see the form.

Please do not email me with individual requests.  I will be issuing SPNs and Prerequisite Overrides no earlier than 2 weeks before the start of the Fall 2018 semester.

Welcome to new lab members

This fall we are welcoming two new students to our group:  Mihee Lee and Yuting Wang.   Mihee joins us from R&D at Samsung, where she worked after completing her BS in Math Ewha Woman’s University, Korea.  Yuting completed her MS at the Karlsruhe Institute of Technology in Germany and was also a visiting student at CMU.

Please join me in welcoming Mihee and  Yuting to Rutgers and Seqam Lab.

Distributed Probabilistic Learning

1. Abstract

Traditional computer vision algorithms, particularly those that exploit various probabilistic and learning-based approaches, are often formulated in centralized settings. However, modern computational settings are becoming increasingly characterized by networks of peer-to-peer connected devices, with local data processing abilities. A number of distributed algorithms have been proposed to address the problems such as calibration, pose estimation, tracking, object and activity recognition in large camera networks [1],[2].

One critical challenge in distributed data analysis includes dealing with missing data. In camera networks, different nodes will only have access to a partial set of data features because of varying camera views or object movement. For instance, object points used for SfM may be visible only in some cameras and only in particular object poses. As a consequence, different nodes will be frequently exposed to missing data. However, most current distributed data analysis methods are algebraic in nature and cannot seamlessly handle such missing data.

In this work we present an approach to estimation and learning of generative probabilistic models in a distributed context where certain sensor data can be missing. In particular, we show how traditional centralized models, such as probabilistic PCA (PPCA) [3] and missing-data PPCA [4], Bayesian PCA (BPCA) [4] can be learned when the data is distributed across a network of sensors. We demonstrate the utility of this approach on the problem of distributed affine structure from motion. Our experiments suggest that the accuracy of the learned probabilistic structure and motion models rivals that of traditional centralized factorization methods while being able to handle challenging situations such as missing or noisy observations.

2. Distributed Probabilistic Learning

We propose a distributed consensus learning approach for parametric probabilistic models with latent variables that can effectively deal with missing data. The goal of the network of sensors is to learn a single consensus probabilistic model (e.g., 3D object structure) without ever resorting to a centralized data pooling and centralized computation. Let \( \mathbf{X} = \{ \mathbf{x}_{n} | \mathbf{x}_{n} \in \mathcal{R}^{D} \} \) be a set of iid multivariate data points with the corresponding latent variables  \( \mathbf{Z} = \{ \mathbf{z}_{n} | \mathbf{z}_{n} \in \mathcal{R}^{M} \} \), \(n = 1 … N\). Our model is a joint density defined on \( (\mathbf{x}_{n}, \mathbf{z}_{n}) \) with a global parameter \( \theta \),

\[
(\mathbf{x}_{n}, \mathbf{z}_{n}) \sim p(\mathbf{x}_{n}, \mathbf{z}_{n} | \theta),
\]

with \( p(\mathbf{X}, \mathbf{Z} | \theta) = \prod_n p(\mathbf{x}_{n}, \mathbf{z}_{n} | \theta) \), as depicted in Figure 1-a. In this general model, we can find an optimal global parameter \( \hat{\theta} \) (in a MAP sense) by applying standard EM learning. It is important to point out that each posterior density estimate at point \( n \) depends solely on the corresponding measurement \( \mathbf{x}_{n} \) and does not depend on any other \( \mathbf{x}_{k}, k \neq n \), hence is decentralized. To consider the distributed counterpart of this model, let \( G = (V, E) \) be an undirected connected graph with vertices \( i, j \in V \) and edges \( e_{ij} = (i, j) \in E \) connecting the two vertices. Each \( i \)-th node is directly connected with 1-hop neighbors in \( \mathcal{B}_{i} = \{ j; e_{ij} \in E \} \). Suppose the set of data samples at \( i \)-th node is \( \mathbf{X}_{i} = \{ \mathbf{x}_{in}; n = 1, … , N_{i} \} \), where \( \mathbf{x}_{in} \in \mathcal{R}^{D} \) is \( n \)-th measurement vector and \( N_{i} \) is the number of samples collected in \( i \)-th node. Likewise, we define the latent variable set for node \( i \) as \( \mathbf{Z}_{i} = \{ \mathbf{z}_{in}; n = 1, … , N_{i} \} \).

Learning the model parameter would be decentralized if each node had its own independent parameter \(\theta_i\). Still, the centralized model can be equivalently defined using the set of local parameters, with an additional constraint on their consensus, \( \theta_1 = \theta_2 = \cdots = \theta_{|V|} \). This is illustrated in Figure 1-b where the local node models are constrained using ties defined on the underlying graph. The simple consensus tying can be more conveniently defined using a set of auxiliary variables \( \rho_{ij} \), one for each edge \( e_{ij} \) (Figure 1-c). This now leads to the final distributed consensus learning formulation, similar to [5]:

\begin{align*} \label{dpm_opt1} \hat{\mathbf{\theta}} = \arg\min_{ \{ \theta_{i} : i \in V \} } & -\log p( \mathbf{X} | \mathbf{\theta}, G) \\ s.t. &\quad \theta_{i} = \rho_{ij}, \rho_{ij} = \theta_{j}, i \in V, j \in \mathcal{B}_{i}. \end{align*}

This is a constrained optimization task that can be solved in a principal manner using the Alternating Direction Method of Multipliers (ADMM) [6]. ADMM iteratively, in a block-coordinate fashion, solves \( \max_{\lambda} \min_{\theta} \mathcal{L}(\cdot) \) on the augmented Lagrangian

\begin{align*} \label{dpm_opt2} \mathcal{L}( \mathbf{\theta}, \rho, \lambda ) &= -\log p( \mathbf{X} | \theta_{1}, \theta_{2}, … , \theta_{|V|}, G) \\ &\quad + \sum_{i \in V} \sum_{j \in \mathcal{B}_{i}} \left\{ \lambda_{ij1}^{\text{T}} ( \theta_{i} – \rho_{ij} ) + \lambda_{ij2}^{\text{T}} ( \rho_{ij} – \theta_{j} ) \right\} \nonumber \\ &\quad + \frac{ \eta }{ 2 } \sum_{i \in V} \sum_{j \in \mathcal{B}_{i}} \left\{ || \theta_{i} – \rho_{ij} ||^{2} + || \rho_{ij} – \theta_{j} ||^{2} \right\} \end{align*}

where \( \lambda_{ij1}, \lambda_{ij2}, i,j \in V \) are the Lagrange multipliers, \( \eta \) is some positive scalar parameter and \( ||\cdot|| \) is induced norm. The last term (modulated by \( \eta \) ) is not strictly necessary for consensus but introduces additional regularization.

 

3. Distributed Probabilistic Principal Component Analysis (D-PPCA)

Distributed versions of PPCA and missing-data PPCA can be derived straightforwardly based on the model above. Detailed information, including derivation of iterative formula for distributed EM [5] can be found in Yoon and Pavlovic (2012).

We tested D-PPCA using synthetic Gaussian data including the case when some of the values are missing. As one can see in Figure 2, D-PPCA, regardless of existence of missing data (either missing-at-random (MAR) or missing-not-at-random (MNAR)), showed competence against the centralized counterpart. We also report empirical convergence analysis in the supplementary material.

We also applied D-PPCA to the problem of distributed affine SfM. We conducted experiments on both synthetic (multiple cameras observing a rotating cube) and real settings. For real settings, we used videos obtained from Caltech [7] and Johns Hopkins [8]. We simulated multiple camera setting by sequentially dividing frames by the number of cameras, in our case 5, i.e. frame no. 1~6 are assigned to camera 1, 7~12 are assigned to camera 2, etc. assuming we have 5 cameras and 30 frames in total. We compared D-PPCA reconstructed structure with centralized, SVD-based reconstructed structure by using subspace angle. Table 1 below shows the result we obtained from Caltech turntable dataset. It clearly shows that D-PPCA rivals that of SVD-based methods even when some values in observation are missing. For detailed and additional results and explanation, please refer the attached manuscript and supplementary materials.


4. Distributed Bayesian Principal Component Analysis (D-BPCA)

We can also apply the similar framework to obtain distributed extension of the mean field variational inference formulation (MFVI). It is easy to show that the distributed counterpart is equivalent to the centralized mean field variational inference optimization problem:

\begin{align*}\label{ooo} [\hat{\lambda}_Z, \hat{\lambda}_W] = &\underset{\lambda_{Z_i}, \lambda_{W_i}: i \in V}{\arg\min}\; – \mathbb{E}_Q\big[ \log P(X,Z,W|\Omega_z,\Omega_w) \big] + \mathbb{E}_Q[\log Q] \nonumber \\ &s.t. \;\; \lambda_{W_i} = \rho_{ij},\;\;\rho_{ij} = \lambda_{W_j},\;\; i\in V, j\in \mathcal{B}_i \end{align*}

where \( Z=\{z_i \in \mathbb{R}^{M}\}_{i=1}^{N} \) denote a set of local latent variables,  \( W \) denotes a global latent variable and \( \Omega=[\Omega_z, \Omega_w] \) denote a set of fixed parameters, and the form of \( Q(z_n;\lambda_{z_n}) \) and \( Q(W;\lambda_{W}) \) are set to be in the same exponential family as the conditional distributions \( P(W|X,Z,\Omega_w) \), Using conjugate exponential family for prior and likelihood distributions, each coordinate descent update in MFVI can be done in closed form. However, the penalty terms would be quadratic in the norm difference of \( (\lambda_{W_i} – \rho_{ij}) \), that may result in the non-analytic updates for \( \{\lambda_{W_i}\}_{i=1}^{|V|} \)

To solve the above problem efficiently, we propose to use Bregman ADMM (B-ADMM) [9] which generalizes the ADMM by replacing the quadratic penalty term by different Bregman divergences in order to exploit the structure of problems. We propose to use the log partition function of the global parameter as the bregman function. Based on the proposed Bregman function, we can obtain the analytical update formula for BADMM, which have closed form solutions. Figure below shows the performance comparison of the distributed affine structure from motion experiment explained above using the centralized versions of SVD, PPCA and BPCA (using variational inference) and the proposed distributed versions of PPCA and BPCA with varying noise levels.

Related Publications

  • S. Yoon and V. Pavlovic, “Decentralized Probabilistic Learning For Sensor Networks,” in IEEE Global Conference on Signal and Information Processing, 2016.
    [BibTeX]
    @InProceedings{yoon16gsip,
    author = {Sejong Yoon and Vladimir Pavlovic},
    title = {Decentralized Probabilistic Learning For Sensor Networks},
    booktitle = {IEEE Global Conference on Signal and Information Processing},
    year = {2016},
    month = dec,
    note = {50\% contribution.},
    date-added = {2016-09-11 21:34:27 +0000},
    date-modified = {2016-09-11 21:35:57 +0000},
    }

  • C. Song, S. Yoon, and V. Pavlovic, “Fast ADMM Algorithm for Distributed Optimization with Adaptive Penalty,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, {USA.}, 2016, p. 753–759.
    [BibTeX] [Download PDF]
    @InProceedings{yoon16aaai,
    author = {Changkyu Song and Sejong Yoon and Vladimir Pavlovic},
    title = {Fast {ADMM} Algorithm for Distributed Optimization with Adaptive Penalty},
    booktitle = {Proceedings of the Thirtieth {AAAI} Conference on Artificial Intelligence},
    year = {2016},
    pages = {753--759},
    address = {Phoenix, Arizona, {USA.}},
    month = feb,
    note = {33\% contribution},
    bdsk-url-1 = {http://arxiv.org/abs/1506.08928},
    date-modified = {2016-09-11 21:16:22 +0000},
    url = {http://arxiv.org/abs/1506.08928},
    }

  • B. Babagholami, S. Yoon, and V. Pavlovic, “D-MFVI: Distributed Mean Field Variational Inference using Bregman ADMM,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, {USA.}, 2016, p. 1582–158.
    [BibTeX] [Download PDF]
    @InProceedings{behnam16aaai,
    author = {Behnam Babagholami and Sejong Yoon and Vladimir Pavlovic},
    title = {{D-MFVI}: Distributed Mean Field Variational Inference using Bregman {ADMM}},
    booktitle = {Proceedings of the Thirtieth {AAAI} Conference on Artificial Intelligence},
    year = {2016},
    pages = {1582--158},
    address = {Phoenix, Arizona, {USA.}},
    month = feb,
    note = {33\% contribution},
    bdsk-url-1 = {http://arxiv.org/abs/1507.00824},
    date-modified = {2016-09-11 21:16:12 +0000},
    url = {http://arxiv.org/abs/1507.00824},
    }

  • S. Yoon and V. Pavlovic, “Distributed Probabilistic Learning for Camera Networks with Missing Data,” in Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., 2012, p. 2933–2941.
    [BibTeX] [Download PDF]
    @InProceedings{yoon12nips,
    author = {Sejong Yoon and Vladimir Pavlovic},
    title = {Distributed Probabilistic Learning for Camera Networks with Missing Data},
    booktitle = {Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States.},
    year = {2012},
    pages = {2933--2941},
    note = {50\% contribution},
    bdsk-url-1 = {http://papers.nips.cc/paper/4629-distributed-probabilistic-learning-for-camera-networks-with-missing-data},
    url = {http://papers.nips.cc/paper/4629-distributed-probabilistic-learning-for-camera-networks-with-missing-data},
    }

References

  1. R. J. Radke. “A Survey of Distributed Computer Vision Algorithms”. Nakashima, Hideyuki, Aghajan, Hamid, Augusto and J. Carlos eds. Springer Science+Business Media, LLC. 2010.
  2. R. Tron and R. Vidal. “Distributed Computer Vision Algorithms”, IEEE Signal Processing Magazine, Vol. 28. 2011, pp. 32-45.
  3. M. E. Tipping and C. M. Bishop. “Probabilistic Principal Component Analysis”, Journal of the Royal Statistical Society, Vol. Series B. 1999, pp. 611-622.
  4. A. Ilin and T. Raiko. “Practical Approaches to Principal Component Analysis in the Presence of Missing Values”, Journal of Machine Learning Research, Vol. 11. 2010, pp. 1957-2000.
  5. P. A. Forero, A. Cano and G. B. Giannakis. “Distributed clustering using wireless sensor networks”, IEEE Journal of Selected Topics in Signal Processing, Vol. 5, August, 2011, pp. 707-724.
  6. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, vol. 3, Now Publishers, 2011.
  7. P. Moreels and P. Perona. “Evaluation of Features Detectors and Descriptors based on 3D Objects”, International Journal of Computer Vision, Vol. 73. 2007, pp. 263-284.
  8. R. Tron and R. Vidal. “A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms“, IEEE International Conference on Computer Vision and Pattern Recognition. 2007.
  9. H. Wang and A. Banerjee, “Bregman Alternating Direction Method of Multipliers”, Advances in Neural Information Processing Systems 27, 2014.

Recognition of Ancient Roman Coins using Spatial Information

Recognition of Ancient Roman Coins

1. Problem Formulation

For a given Roman coin image, the goal is to recognize who is on the coin

<!–

There are thousand of different ways to define the Roman coins. For example, we can classify the coins by attributes such as symbols, sizes, materials and legend. Please note that those attributes are correlated together. One attribute may help reveal the other attributes. In this project, we focus on a face recognition problem where for a given Roman coin image, the goal is to recognize who is on the coin. So, for above images, we want to know that the Roman emperor, Caligular, is engraved on the coin. This is for Maximus second, the famous emperors Nero and Tiberius.

–>

2. Motivation

Understanding the ancient Roman coins could serve as references to understand the Roman empire

  • The Roman coins are always connected to Roman historical events and Roman imperial propaganda
  • The Roman empire knew how to effectively use the coin as their political propaganda
  • The Roman coins were widely used to convey the achievements of Roman emperors to public
  • The Roman coins were served to spread messages of changing policies or merits through the empire
  • The Roman emperors also could show themselves to the entire empire by engraving portraits on the coins
  • The Roman coins were the newspaper of the Roman empire

3. Practical Application

A reliable and automatic method to recognize the coins is necessary

  • The coin market is very active as many people are collecting coins as hobby. Also the coins were massively produced and new Roman coins are daily excavated, making themselves affordable to collect.
  • Ancient coins are becoming subject to a very large illicit trade. Recognition of the ancient Roman coins is not easy for novices but requires knowledge.
  • <!–

  • A traditional way is to periodically and manually search catalogue, dealers or the Internet by authority forces.
  • –>

4. Challenges

  • Inter-class similarity due to engraver’s lack of knowledge for the emperor’s portrait and abstraction
  • Intra-class dissimilarity. The coins were made manually from different factories
  • The recognition of the face on the coin is different from that of the real face

5. Coin Data Collection

  • Coin images are collected from a numismatic website [1, 2]
  • 2815 coin images with 15 Roman emperors
  • – Small part of the much larger dataset
    – Annotated for visual analysis (the original dataset only has numismatic annotation)
    – Each emperor has at least 10 coin images

  • High resolution images : 350-by-350 pixels

6. Coin Recognition Methods using Spatial Information

  1. Deformable Part Model (DPM) based method
  2. – Precise encoding of spatial information more specifically than spatial pyramid by alignment
    – DPM is used to align the coin image by locating the face of the emperor
    – Training and test of DPM

  3. Fisher Vector based method
  4. – Each point is presented as a combination of visual features and location, (x, l)
    – Gaussian mixture model to describe probability of (x, l)
    \[ \begin{eqnarray} p(\mathbf{x}, \mathbf{l}) & = & \sum_k \pi_k \cdot p(\mathbf{x}, \mathbf{l}; {\Sigma}_k^V, {\Sigma}_k^L, {\mu}_k^V, {\mu}_k^L) \nonumber \\ & = & \sum_k \pi_k \cdot p(\mathbf{x}; {\Sigma}_k^V, {\mu}_k^V) \cdot p(\mathbf{l}; {\Sigma}_k^L, {\mu}_k^L), \end{eqnarray} \] where \(\pi_k\) is a prior probability for the \(k\)th component, \({\Sigma}_k^V, {\mu}_k^V\) are means and covariances for the visual descriptors, \({\Sigma}_k^L, {\mu}_k^L\) mean and covariance for the location, and \[ \begin{eqnarray} p(\mathbf{x}; {\Sigma}_k^V, {\mu}_k^V) & \quad \sim \quad & \mathcal{N} (\mathbf{x}; {\Sigma}_k^V, {\mu}_k^V)\\ p(\mathbf{l}; {\Sigma}_k^L, {\mu}_k^L) & \quad \sim \quad & \mathcal{N} (\mathbf{l}; {\Sigma}_k^L, {\mu}_k^L). \end{eqnarray} \] The gradient with respect to the mu and sigma defines the Fisher vector.

7. Experimental Results

  • Experimental settings
  • – 2815 coin images with 15 emperors
    – For evaluation, divide the coin dataset into 5 fold splits, training on 4 splits and testing on 1 split
    – SIFT as visual feature
    – Multi-class SVM for training and prediction

  • Recognition accuracies for various methods
  • Confusion matrices
  • Discriminative regions
  • Outlier detection

8. Conclusion

We proposed two automatic methods to recognize the ancient Roman coins. The first method employs the deformable part model to align the coin images to improve the recognition accuracy. The second method facilitates the spatial information of the coin by directly encoding the location information. As the first method takes the information of the face location into account, it performs slightly better than the second method. The experiments show that both methods outperform the other methods such as the standard spatial pyramid model and human face recognition method.

In this project, we collect a new ancient Roman coin dataset and investigate an automatic framework to recognize the coins where we employ the state-of-the-art face recognition system and facilitate the spatial information of the coin to improve the recognition accuracy. The coin images are high-resolution (350-by-350 pixels) and the face locations are annotated. While the proposed coin recognition framework is based on the standard methods such as bag-of-words with spatial pyramids, Fisher vectors and DPM, we believe that their use in the context of the ancient coin recognition represents an interesting contribution.

References

  • [1] J. Kim and V. Pavlovic. “Ancient Coin Recognition Based on Spatial Coding”. Proc. International Conference on Pattern Recognition (ICPR). 2014.
  • [2] J. Kim and V. Pavlovic. “Improving Ancient Roman Coin Recognition with Alignment and Spatial Encoding”. ECCV Workshop VISART. 2014.

Hybrid On-line 3D Face and Facial Actions Tracking in RGBD Video Sequences

1. Abstract

Tracking human faces has remained an active research area among the computer vision community for a long time due to its usefulness in a number of applications, such as video surveillance, expression analysis and human-computer interaction. An automatic vision-based tracking system is desirable and such a system should be capable of recovering the head pose and facial features, or facial actions. It is a non-trivial task because of the highly deformable nature of faces and their rich variability in appearances.

A popular approach for face modeling and alignment is using statistical models such as Active Shape Models and Active Appearance Models. These techniques have been refined over long period of time and proven to be really robust. However, they were originally developed to work on 2D texture and require intensive preparation of training data. Using 3D morphable model on the other hand is another approach. In these techniques, a 3D facial shape model is deformed to fit to input data. These trackers rely on either texture or depth, not taking advantages of both sources of information or using them sparsely. In addition, sophisticated trackers use specially designed 3D face models which are not freely available. Lastly, they often require prior training or manual initial alignment of the face model performed by human operators.

In this work, we propose a hybrid on-line 3D face tracker to take advantages of both texture and depth information, which is capable of tracking 3D head pose and facial actions simultaneously. First, we employ a generic deformable model, the Candide-3, into our ICP fitting framework. Second, we introduce a strategy to automatically initialize the tracker using the depth information. Lastly, we propose a hybrid tracking framework that combines ICP and OAM to utilize the strengths of both techniques. The ICP algorithm, which is aided by optical flow to correctly follow large head movement, robustly tracks the head pose across frames using depth information. It provides a good initialization for OAM. In return, the OAM algorithm maintains the texture model of the face, adjusts any drifting incurred by ICP and transforms the 3D shape closer to correct deformation, which then provides ICP with a good initialization in the next frame.

2. Parameterized Face Model

We use an off-the-shell 3D deformable model, Candide-3, which was developed by J. Ahlberg [1]. The deformation of the face model is controlled by Shape Units (SUs) which represent face biometry specific to a person, and Action Units (AUs) which control facial expressions and are user-invariant. Since every vertex can be transformed independently, each vertex of the model is reshaped according to: \[g = p_0 + S\sigma + A\alpha \] where $p_0$ is the base coordinates of a vertex {\it p}, S and A are shape and action deformation matrices associated with vertex {\it p}, respectively. $\sigma$ is the vector of shape deformation parameters and $\alpha$ is the action deformation parameters vector. In general, the transformation of a vertex given global motion including rotation {\it R} and translation {\it t} is defined as: \[p’ = R(p_0 + S\sigma + A\alpha ) + t \]

We use the first frame to estimate the SU parameters corresponding to the test subject in neutral expression, together with initial head pose. From the second frame onwards , we keep shape unit parameters $\sigma$ unchanged and track the action unit parameters $\alpha$, along with head pose {\it R} and {\it t}. 7 action units are tracked in our framework as depicted below.


3. Initialization

The initialization pipeline is described in the following figure:

First, using a general 2D face alignment algorithm, we can reliably detect 6 features points (eye/mouth corners) as shown below

These 2D points are back-projected to world coordnates to form a set of 3D correspondences using the depth map. Then using the registration technique in [2], we recover the initial head pose. We use some heuristics to guess initial shape parameters by searching for facial parts (nose, chin). Lastly, we jointly optimize pose and shape unit parameters by minimizing the the following ICP energy:

\[\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}
\over R} ,\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}
\over t} ,\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}
\over \sigma }  = \mathop {\arg \min }\limits_{R,t,\sigma } \sum\limits_{i = 1}^N {{{\left\| {R({p_0} + {S_i}\sigma ) + t – {d_i}} \right\|}^2}} \]

Levenberg-Marquardt algorithm is used to solve the above non-linear least squares problem [3].

4. Tracking

The overall tracking process is given in the below diagram:

The tracking process starts with minimizing the ICP energy to recover the head pose and action unit parameters. The procedure is similar to Algorithm 1, with only one change: in the first iteration, the correspondences are formed by optical flow tracking of the 2D-projected vertex features from the previous color frame to the current color frame. From the second iteration, correspondences are found by searching for closest points.

Optical flow inherently introduces drifting into tracking, and the error accumulated over time will certainly reduce the tracking performance. Thus we incorporate On-line Appearance Model as a refinement step in our tracker using the full facial texture information while maintaining the no-training requirement.

The On-line Appearance Model in our tracker is similar to that of [4], in which:
-The appearance model is represented in a fixed-sized template.
-The mean appearance is built on-line for the current user after the 1st frame
– Each pixel in the template is modeled by an independent Gaussian distribution and thus the appearance vector is a multivariate Gaussian distribution which is updated over time:
\[{\mu _{{i_{t + 1}}}} = \left( {1 – \alpha } \right){\mu _{{i_t}}} + \alpha {\chi _{{i_t}}} \]
\[\sigma _{{i_{t + 1}}}^2 = \left( {1 – \alpha } \right)\sigma _{{i_t}}^2 + \alpha {\left( {{\chi _{{i_t}}} – {\mu _{{i_t}}}} \right)^2} \]

The final transformation parameters are found by minimizing the Mahalanobis distance (u is the (R, t, α) parameters vector)
\[{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}
\over u} }_t} = \mathop {\arg \min }\limits_{{u_t}} {\sum\limits_{i = 1}^n {\left( {\frac{{\chi {{({u_t})}_i} – {\mu _{{i_t}}}}}{{{\sigma _{{i_t}}}}}} \right)} ^2} \]

5. Experiments

5.1. Synthetic Data

Our single-threaded C++ implementation can run at up to 16fps on a 2.3Ghz Intel Xeon CPU, unfortunately that’s not fast enough to run on live stream. We generate 446 synthetic RGBD sequences from BU-4DFE dataset [5] where the initial frames contain neutral expression, with white noise applied to the depth maps. The size of the rendered face is about 200×250 pixels.

We compare the results of our tracker to a pure ICP-based tracker whose resulting parameters are clamped within predefined boundaries to prevent drifting. The errors shown in Table 1 do not truly reflect the superior performance of the hybrid tracker over the ICP tracker as seen in the figure.


5.2. Real RGB-D sequences

We capture sequences from a Kinect and a Senz3D cameras. In the Kinect sequence, the depth map is aligned to the color image, and our tracker performs really well.

In the sequence captured from the Senz3D camera, due to the disparity between the texture and the depth map resolutions, we map the texture to the depth map instead – the generated texture thus becomes very noisy but the tracker can still works reasonably.

Publication

  • H.X. Pham and V. Pavlovic, “Hybrid On-line 3D Face and Facial Actions Tracking in RGBD Video Sequences” In: Proc. International Conference on Pattern Recognition (ICPR). (2014)

References

  • [1] J. Ahlberg, “An updated parameterized face” Image Coding Group,Dept. of Electrical Engineering, Linkoping University, Tech. Rep.
  • [2] K. S. Arun, T. S. Huang, and S. D. Blostein, “Least-squares fitting of two 3d point sets,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, no. 5, pp. 698–700, 1987.
  • [3] A. W. Fitzgibbon, “Robust registration of 2d and 3d point sets,” Image and Vis. Comput., no. 21(13-14), pp. 1145–1153, 2003.
  • [4] F. Dornaika and J. Orozco, “Real-time 3d face and facial feature tracking,” J. Real-time Image Proc., pp. 35–44, 2007.
  • [5] L. Yin, X. Chen, Y. Sun, T. Worm and M. Reale, “A High-Resolution 3D Dynamic Expression Database”, in IEEE FG’08, 2008.


Depth Recovery With Face Priors

*Chongyu Chen was with Nanyang Technological University*

1. Abstract.

Existing depth recovery methods for commodity RGB-D sensors primarily rely on low-level information for repairing the measured depth estimates. However, as the distance of the scene from the camera increases, the recovered depth estimates become increasingly unreliable. The human face is often a primary subject in the captured RGB-D data in applications such as the video conference. In this work we propose to incorporate face priors extracted from a general sparse 3D face model into the depth recovery process. In particular, we propose a joint optimization framework that consists of two main steps: deforming the face model for better alignment and applying face priors for improved depth recovery. The two main steps are iteratively and alternatively operated so as to help each other. Evaluations on benchmark datasets demonstrate that the proposed method with face priors significantly outperforms the baseline method that does not use face priors, with up to 15.1% improvement in depth recovery quality and up to 22.3% in registration accuracy.

2. The proposed method.

Given a color image I and its corresponding (aligned) noisy depth map Z as input, our goal is to obtain a good depth map of the face region using the face priors derived from the general 3D deformable model. The pipeline of the proposed method is shown below.

The first two components are pre-processing steps to roughly clean up the depth data and roughly align the general face model to the input point cloud. The last two components are the core of our proposed framework. For component of the guided depth recovery, we fix the face prior and use it to update the depth, while for the last component, we fix the depth and update the face prior. The last two components alternatively and iteratively operate until convergence.

Based on [1], the energy function is formulated as
\[\min\limits_{U, u} E_r(U) + \lambda_d E_d(U) + \lambda_f E_f(U, u)\]

The first two terms are similar to [1]. The last term is the new face prior term, which is defined as following:
\[E_f(U, u) = \sum_{i \in \Omega_f} \eta_i \left( U(i) – T_f( P(u), i) \right)^2\]

U represents the depth map to be recovered, while u represents the parameters of the 3D facial deformable model and Tf is the facial shape transformation function according to u. For more details on the deformation of the face model, please refer to our previous project [2].


Considering that the guidance from the sparse vertices of the Candide model may be too weak to serve as the prior for the full (dense) depth map U, we need to generate a dense synthetic depth map Y from the aligned face prior P(u) using an interpolation process. It is possible to define different interpolation functions according to desired dense surface properties. In computer graphics, such models may use non-uniform rational basis spline (NURBS) to guarantee surface smoothness. Here, for the purpose of a shape prior we choose a simple piece-wise linear interpolation. This process is denoted as
\[Y = \text{lerp}( P(u) )\]
, which is demonstrated in the figure

To mitigate the effects of the piece-wise flat dense patches due to the linear interpolation, we introduce a weighting scheme defined through weights ηi. In particular, for each pixel Y(i), we use a normalized weight that is adaptive to the pixel’s distances from the neighboring vertices of the sparse shape P. Let (ai,bi,ci) be the barycentric coordinates of pixel i inside a triangle defined by its three neighboring vertices of P. Then, its weight is computed as
\[\eta_i = \sqrt{a_i^2+b_i^2+c_i^2}, \ i \in \Omega_f.\]
This suggests that the pixels corresponding to model vertices have the highest weight of $1$ while the weights decline towards the center of each triangular patch. An illustration of the weights is given in the figure below, where bright pixels represent large weights.

3. Energy optimization.

From the definition of the energy function, it can be seen that the overall optimization of U remains a convex task, for a given fixed prior P. However, the optimization of the face model parameter set u might not be convex since it involves rigid and non-rigid deformation. Therefore, to tackle the global optimization task which includes both the depth U and the deformation u recovery, we resort to a standard recursive alternate optimization process. In other words, we will first optimize u while keeping U fixed, and then optimize U for the fixed deformation u. Specifically, we divide our problem into three well studied subproblems: depth recovery, rigid registration, and non-rigid deformation. The algorithm is detailed below:

4. Experiments.
a. Synthetic data

We first use the BU4D Facial Expression Database [3] for quantitative evaluation. Considering that Kinect is the most popular commodity RGB-D sensor, we add some Kinect-like artifacts to the depth maps generated from the BU4D database.

We rendered the BU4DFE data to different distances: 1.2m, 1.5m, 1.75m and 2m. By using synthetic data,
we are able to obtain the ground truth for quantitative evaluation
. Specifically, we measure the depth recovery
performance in terms of average pixel-wise Mean Absolute Error (MAE) in mm. The following plot compare the
results of our method to the baseline [1].


Besides the recovery error, we also evaluate the registration accuracy. To get the reference registration and shapes, we fit the 3D face model to noise-free data. The face model is also fitted to the depth maps obtained by different methods. We then compare the fitting result with the reference registration. Table 1 shows that the proposed method produces a more accurate face registration compared to the baseline method, especially in the eyes’ region and around the face boundary.



Samples at 1.75m

Samples at 2m

Blendshape-based 3D Face Tracking

A result sample of our 3D face tracker, where (a) shows the 3D landmarks projected onto image plane, (b,c) show the 3D blendshape model and the input point cloud, and (d) shows the skinned 3D shape.

1. Abstract.

We introduce a novel robust hybrid 3D face tracking framework from RGBD video streams, which is capable of tracking head pose and facial actions without pre-calibration or intervention from a user. In particular, we emphasize on improving the tracking performance in instances where the tracked subject is at a large distance from the cameras, and the quality of point cloud deteriorates severely. This is accomplished by the combination of a flexible 3D shape regressor and the joint 2D+3D optimization on shape parameters. Our approach fits facial blendshapes to the point cloud of the human head, while being driven by an efficient and rapid 3D shape regressor trained on generic RGB datasets. As an on-line tracking system, the identity of the unknown user is adapted on-the-fly resulting in improved 3D model reconstruction and consequently better tracking performance. The result is a robust RGBD face tracker, capable of handling a wide range of target scene depths, beyond those that can be afforded by traditional depth or RGB face trackers. Lastly, since the blendshape is not able to accurately recover the real facial shape, we use the tracked 3D face model as a prior in a novel filtering process to further refine the depth map for use in other tasks, such as 3D reconstruction.

2. The tracking framework.

In this work, we use the blendshape model from FaceWarehouse database.

The figure shows the pipeline of the proposed face tracking framework, which follows a coarse-to-fine multi-stage optimization design. In particular, our framework consists of two major stages: shape regression and shape refinement. The shape regressor performs the first optimization stage, which is learned from training data, to quickly estimate shape parameters from the RGB frame. Then, in the second stage, a carefully designed optimization is performed on both the 2D image and the available 3D point cloud data to refine the shape parameters, and finally the identity parameters are updated to improve shape fitting to the input RGBD data.

The 3D shape regressor is the key component to achieve our goal of 3D tracking at large distance, where quality of the depth map is often poor. Unlike the existing RGBD-based face tracking works, which either heavily rely on the accurate input point cloud (at close distances) to model shape transformation by ICP or use off-the-shelf 2D face tracker to guide the shape transformation, we predict the 3D shape parameters directly from the RGB frame by the developed 3D regressor. This is motivated by the success of the 3D shape regression from RGB images. The approach is especially meaningful for our considered large distance scenarios, where the depth quality is poor. Thus, we do not make use of the depth information in the 3D shape regression to avoid profusion of inaccuracies from the depth map.

Initially, a color frame I is passed through the regressor to recover the shape parameters θ. The projection of the Nl landmarks vertices of the 3D shape to image plane typically does not accurately match the 2D landmarks annotated in the training data. We therefore include 2D displacements D into the parameter set and define a new global shape parameter set P = ({θ},D) = (R,T,e,D). The advantages of including D in P are two-fold. First, it helps train the regressor to reproduce the landmarks in the test image similar to those in the training set. Second, it prepares the regressor to work with unseen identity which does not appear in the training set. In such case the displacement error D may be large to compensate for the difference in identities. The regression process can be expressed as \[P^{out} = {f_{r}}(I,P^{in})\], where fr is the regression function, I is the current frame, Pin and Pout are the input (from the shape regression for the previous frame) and output shape parameter sets, respectively. The coarse estimates Pout are refined further in the next stage, using more precise energy optimization added with depth information. Specifically, \[ \theta = (R,T,e)\] are optimized w.r.t both the 2D prior constraints provided by the estimated 2D landmarks by the shape regressor and the 3D point cloud. Lastly, the identity vector wid is re-estimated given the current transformation. (For more details, please refer to our manuscript on arXiv).

The effect of using depth data for regularization: (a,b) without depth data; (c,d) with depth data

Identity adaptation:

3. Tracking results.

– On BU4DFE dataset

– On real RGBD sequences

with occlusion:

4. Depth recovery using dense shape priors.

Based on our previous work, we replace the sparse Candide face model with blendshape and develop the depth recovery process as a filter on depth map.

A result sample on real data at 2m: (a) the prior, (b) the raw depth data (c) filtered without prior (d) filtered with prior

Publication:

  • H. X. Pham, C. Chen, L. N. Dao, V. Pavlovic, J. Cai and T.-J. Cham. “Robust Performance-driven 3D Face Tracking in Long Range Depth Scenes”. 2015


Multi-Cue Structure Preserving MRF for Unconstrained Video Segmentation

Abstract

Video segmentation is a stepping stone to understanding video context. Video segmentation enables one to represent a video by decomposing it into coherent regions which comprise whole or parts of objects. However, the challenge originates from the fact that most of the video segmentation algorithms are based on unsupervised learning due to expensive cost of pixelwise video annotation and intra-class variability within similar unconstrained video classes. We propose a Markov Random Field model for unconstrained video segmentation that relies on tight integration of multiple cues: vertices are defined from contour based superpixels, unary potentials from temporal smooth label likelihood and pairwise potentials from global structure of a video. Multi-cue structure is a breakthrough to extracting coherent object regions for unconstrained videos in absence of supervision. Our experiments on VSB100 dataset show that the proposed model significantly outperforms competing state-of-the-art algorithms. Qualitative analysis illustrates that video segmentation result of the proposed model is consistent with human perception of objects.

1  Overview

Figure 1: Overview of the framework. (a) Node potential depends on histogram of temporal smooth pixelwise labels of the corresponding frame. Spatial edge potentials: (b) Gray intensity represents contour strength. (c) RGB color is displayed for better visualization. (d) Color represents motion direction. (e) Color represents visual word identity of each dense SIFT feature. Temporal edge potential depends on correspondence ratio on long trajectory and color affinity. (f) Superpixels for corresponding vertices in the frame f are illustrated by object contours. For visualization purpose, it shows coarse grained superpixels. Best viewed in color.

 

2  Contributions

In this paper, we propose a novel hierarchical video segmentation model which integrates temporal smooth labels with global structure consistency with preserving object boundaries. Our contributions are as follows:

•     We propose a video segmentation model that preserves multi-cue structures of object boundary and temporal smooth label with global spatio-temporal consistency.

•     We propose an effective pairwise potential to represent spatio-temporal structure evaluated on object boundary, color, optical flow, texture and long trajectory correspondence.

•     Video hierarchy is inferred through the process of graph edge consistency, which generalizes traditional hierarchy induction approaches.

•     The proposed method infers precise coarse grained segmentation, where a segment may represent one whole object.

 

3  Proposed Model

3.1  Multi-Cue Structure Preserving MRF Model

An overview of our framework for video segmentation is depicted in Figure 1. A video is represented as a graph G=(V,E), where a vertex set is defined on contour based superpixels from all frames f∈{1,⋯,F} in the video. For each frame, an object contour map is obtained from contour detector [1]. A region enclosed by a contour forms a superpixel. An edge set describes relationship for each pair of vertices. The edge set consists of spatial edges where and temporal edges where .

Video segmentation is obtained by MAP inference on a Markov Random Field on this graph G, where and Z is the partition function. Vertex i is labeled as from the label set L of size L. MAP inference is equivalent to the following energy minimization problem.

In (1), represents node potentials for a vertex iV and is edge potentials for an edge . As with the edge set E, edge potentials are decomposed into spatial and temporal edge potentials, . The vector indicates label and is the label pair indicator matrix for and . Operators ⋅ and : represent inner product and Frobenius product, respectively. Spatial edge potentials are defined for each edge which connects the vertices in the same frame . In contrast, temporal edge potentials are defined for each pair of vertices in the different frames . It is worth noting that the proposed model includes spatial edges between two vertices that are not spatially adjacent and, similarly, temporal edges are not limited to consecutive frames.

A set of vertices of the graph is defined from contour based superpixels such that the inferred region labels will preserve accurate object boundaries. Node potential parameters are obtained from temporally smooth label likelihood. Edge potential parameters aggregate appearance and motion features to represent global spatio-temporal structure of the video. MAP inference of the proposed Markov Random Field(MRF) model will infer the region labels which preserve object boundary, attain temporal smoothness and are consistent to global structure. Details are described in the following sections.

3.2  Node Potentials

Unary potential parameters represent a cost of labeling vertex iV from a label set L. While edge potentials represent global spatio-temporal structure in a video, node potentials in the proposed model strengthen temporal smoothness for label inference. Temporal smooth label set L is obtained from a greedy agglomerative clustering [10]. The clustering algorithm merges two adjacent blobs in a video when color difference is smaller than the variance of each blob. Node potential parameters represent labeling cost of vertex i from negative label likelihood .

Each superpixel is evaluated by pixelwise cluster labels from L and the label histogram represents label likelihood for the vertex i. As illustrated in Figure 1 (a), a superpixel has a mixture of pixelwise temporal smooth labels because the agglomerative clustering [10] merges unstructured blobs. Let be the number of pixelwise temporal smooth labelb in the corresponding superpixel of vertex i. As described in 3.1, a vertex is defined on a superpixel which is enclosed by an object contour. Arbelaez et al. [1] extract object contours so that taking different threshold values on the contours will produce different granularity levels of enclosed regions. In our proposed model, we take a set of vertices from a video frame f by a single threshold on contours which results in fine-grained superpixels.

3.3  Spatial Edge Potentials

Binary edge potential parameters ψ consist of two different types; spatial and temporal edge potentials, and , respectively . Spatial edge potentials model pairwise relationship of two vertices i and j within a single video frame f. We define these pairwise potentials as follows:

 

A spatial edge potential parameter is the (l,l‘) element of matrix which represents the cost of labeling a pair of vertices i and j as l and l‘, respectively. It takes Potts energy where all different pairs of label take homogeneous cost . Spatial edge potentials are decomposed into , which represent pairwise potentials in the channel of object boundary, color, optical flow direction and texture. Pairwise cost of having different labels is high if the two vertices i and j have high affinity in the corresponding channel. As a result, edge potentials increase the likelihood of assigning the same label to vertices i and j during energy minimization.

The edge potentials take equal weights on all channels. Importance of each channel may depend on video context and different videos have dissimilar contexts. Learning weights of each channel is challenging and it is prone to overfitting due to high variability of video context and limited number of labeled video samples in the dataset. Hence, the propose model equally weights all channels.

The model controls the granularity of segmentation by a threshold τ. In (9), the pairwise potential is thresholded by τ. If τ is set to a high value, only edges with higher affinity will be included in the graph. On the other hand, if we set a low value to τ, the number of edges increases and more vertices will be assigned to the same label because they are densely connected by the edge set. We next discuss each individual potential type in the context of our video segmentation model.

Object Boundary Potentials . Object boundary potentials evaluate cost of two vertices i and j in the same frame assigned to different labels in terms of object boundary information. The potential parameters are defined as follows:

where represents the minimum boundary path weight among all possible paths from a vertex i to j. The potentials are obtained from Gaussian Radial Basis Function(RBF) of with which is the mean of as a normalization term.

If the two superpixels i and j are adjacent, their object boundary potentials are decided by the shared object contour strength , where is the edge connects vertices i and j and the boundary strength is estimated from contour detector [1] . The boundary potentials can be extended to non-adjacent vertices i and j by evaluating a path weight from vertex i to j. For each path p from a vertex i to j, boundary potential of path p is evaluated by taking the maximum edge weights where is an edge along the path p. The algorithm to calculate is described in Algorithm 1, which modifies Floyd-Warshall shortest path algorithm.

Typically, a path in a graph is evaluated by sum of edge weights along the path. However, in case of boundary strength between the two non-adjacent vertices in the graph, total sum of the edge weights along the path is not an effective measurement because the sum of weights is biased toward the number of edges in the path. For example, a path consists edges of weak contour strength may have the higher path weight than another path which consists of smaller number of edges with strong contour. Therefore, we evaluate a path by the maximum edge weight along the path and the path weight is govern by an edge of the strongest contour strength.

Figure 2 illustrates two different path weight models of the max edge weight and the sum edge weight. Figure 2 (a) illustrates contour strength where red color represents high strength. Two vertices indicated by white arrows are selected in an airplane. In Figure 2 (b), two paths are displayed. Path 2 consists of less number of edges but it intersects with a strong contour that represents boundary of the airplane. If we evaluate object boundary score between the two vertices, Path 1 should be considered since it connects vertices within the airplane. Figure 2 (c) shows edge sum path weight from a vertex at tail to all the other vertices. It displays that the minimum path weight between the two vertices are evaluated on Path 2. On the other hand, Figure 2 (d) illustrates that max edge path weight takes Path 1 as minimum path weight which conveys human perception of object hierarchy.

 

Figure 2: Comparison of two types of path weight models.

Color Potentials . Color feature for each vertex is represented by a histogram of CIELab color space in the corresponding superpixel. Color potential between the vertex i and j is evaluated on two color histograms and :

where is Earth Mover’s Distance(EMD) between and of vertices i and j and is the normalization parameter.

Earth Mover’s Distance [16] is a distance measurement between two probability distributions. EMD is typically more accurate over distance in color space of superpixels. An issue with distance is that if the two histograms on simplex do not share non-zero color bins, the two histogram are evaluated with the maximum distance of 1. Therefore, distance of vertices i and j is the same as the distance between i and k, if i,j,k do not share any color bins. This occurs often when we compare color feature of superpixels because superpixel is intended to exhibit coherent color especially in the fine grained level. Superpixels on different objects or different parts of an object may have different colors. For example, if we use distance to measure color difference of superpixels, distance between superpixels of red and orange will have the same distance of red and blue because they do not share color bins. However, this is not intuitive to human perception. In contrast, EMD considers distance among each color bin, hence it is able to distinguish non overlapping color histograms.

Optical Flow Direction Potentials . In each video frame, motion direction feature of ith vertex can be obtained from a histogram of optical flow direction . As with the case of color potentials, we use EMD between the two histograms and to accurately estimate difference direction in motion:

where is the mean EMD distance on optical flow histogram.

Texture Potentials . Dense SIFT features are extracted for each superpixel and Bag-of-Words(BoW) model is obtained from K-means clustering on D-SIFT features. We evaluate SIFT feature on multiple dictionaries of different K. Texture potentials are calculated from RBF on distance of two BoW histograms and , which is a typical choice of distance measurement for BoW model:

where parameter is the mean distance on D-SIFT word histogram.

3.4  Temporal Edge Potentials

Temporal edge potentials define correspondence of vertices at different frames. It relies on long trajectories which convey long range temporal dependencies and more robust than optical flow.

where is a set of long trajectories which pass through vertex i. Pairwise potential represents temporal correspondence of two vertices from overlapping ratio of long trajectories that vertices i and j shares, where and ff‘. In order to distinguish two different objects of the same motion, we integrate color potentials between two vertices. Long trajectories are extracted from  [18].

3.5  Hierarchical Inference on Segmentation Labels

The proposed model attains hierarchical inference of segmentation labels by controlling the number of edges with a fixed set of vertices defined at a finest level of superpixels. As the edge set becomes dense in the graph, the energy function in (1) takes higher penalties from the pairwise potentials. As a consequence, vertices connected by dense edges will be assigned to the same label and it leads to coarse-grained segmentation.

In contrast, another approach that enables hierarchical segmentation is to define a hierarchical vertex set in a graph. A set of vertices in the finer level will be connected to a vertex in coarser level. It introduces another set of edges which connect vertices at different levels of hierarchy.

Our proposed approach on hierarchical inference takes computational advantages over graph representation with a hierarchical vertex set. Our proposed graph representation has less the number of vertices and edges because we have a single finest level of hierarchy without additional vertices for coarser levels. This advantage not only enables an efficient graph inference, but also take less computation time to calculate node and edge potentials for additional vertex and edge sets.

4  Experimental Evaluation

4.1  Dataset

We evaluate the proposed model on VSB100 video segmentation benchmark data provided by Galasso et al. [9]. There are a few additional video datasets which have pixelwise annotation. FBMS-59 dataset [15] consists of 59 video sequences and SegTrack v2 dataset [13] consists of 14 sequences. However, the both datasets annotate on a few major objects leaving whole background area as one label. It is more appropriate for object tracking or background subtraction task. On the other hand, VSB100 consists of 60 test video sequences of maximum 121 frames. For each video, every 20 frame is annotated with pixelwise segmentation labels by four annotators. The dataset contains the largest number of video sequences annotated with pixelwise label, which allows quantitative analysis. The dataset provides a set of evaluation measurements.

Volume Precision-Recall. VPR score measures overlap of the volume between the segmentation result of the proposed algorithm S and ground truths annotated by M annotators. Over-segmentation will have high precision with low recall score.

Boundary Precision-Recall. BPR score measures overlap between object boundaries of the segmentation result S and ground truths boundaries . Conversely to VPR, over-segmentation will have low precision with high recall scores.

4.2  MSP-MRF Setup

In this section, we present the detailed setup of our Multi-Cue Structure Preserving Markov Random Field (MSP-MRF) model for unconstrained video segmentation problem. As described in Section 3.2, we take a single threshold on image contour, so that each frame contains approximately 100 superpixels. We assume that this granularity level is fine enough such that no superpixel at this level will overlay on multiple ground truth regions. Node potential (6) is evaluated for each superpixel with temporal smooth label obtained with agglomerative clustering [10]. Although we chose the 11th fine grained level of hierarchy, Section 4.4 illustrates that the proposed method shows stable performance over different label set size |L| for node potential. Finally, edge potential is estimated as in (9), (14). For color histograms, we used 50 bins for each CIELab color channel. In addition, 50 bins were set for horizontal and vertical motion of optical flow. For D-SIFT Bag-of-Words model, we used 5 dictionaries of K=100,200,400,800,1000 words. Energy minimization problem in (1) for MRF inference is optimized using FastPD algorithm [12].

 

 

Figure 3: Temporal consistency recovered by MSP-MRF.

 

Figure 4: Comparison of segmentation boundary on the same granularity levels on two videos.

4.3  Qualitative Analysis

Figure 3 illustrates a segmentation result on an airplane video sequence. MSP-MRF rectifies temporally inconsistent segmentation result of [10]. For example, in the fourth column of Figure 3, the red bounding boxes show MSP-MRF rectified label from Grundmann’s result such that labels across frames become spatio-temporally consistent.

In addition, control parameter τ successfully obtains different granularity level of segmentation. For MSP-MRF, the number of region labels is decreased as τ decreases. Figure 4 compares video segmentation results of MSP-MRF with Grundmann’s by displaying segmentation boundary on the same granularity levels, where the two methods have the same number of segments in the video. MSP-MRF infers spatial smooth object regions, which illustrates the fact that the proposed model successfully captures spatial structure of objects.

 

Figure 5: PR curve comparison to other models.

 

Figure 6: PR curve on different size of label set L.

Table 1: Performance of MSP-MRF model compared with state-of-the-art video segmentation algorithms on VSB100.

4.4  PR Curve on High recall regions

We specifically consider high recall regions of segmentation since we are typically interested in videos with relatively few objects. Our proposed method improves and rectifies state-of-the-art video segmentation of greedy agglomerative clustering [10], because we make use of structural information of object boundary, color, optical flow, texture and temporal correspondence from long trajectories. Figure 5 shows that the proposed method achieves significant improvement over state-of-the-art algorithms. MSP-MRF improves in both BPR and VPR scores such that it is close to Oracle which evaluates contour based superpixels on ground truth. Hence, it is worth noting that oracle is the best accuracy that MSP-MRF could possibly achieve because MSP-MRF takes contour based superpixels from [1] as well.

The proposed MSP-MRF model rectifies agglomerative clustering by merging two different labels of vertices if it reduces overall cost defined in (1). By increasing the number of edges in the graph by lowering threshold value, the model leads to coarser grained segmentation. As a result, MSP-MRF only covers higher recall regions from precision-recall scores of the selected label set size |L| from [10]. A hybrid model that covers high precision regions is described in Section 4.5.

Figure 6 illustrates the PR curve of MSP-MRF on different granularity levels of label set |L| in node potential (6). Dashed-green line is the result of greedy agglomerative clustering [10]. Solid-green line is the result of MSP-MRF with edge threshold τ set to 1, which leaves no edge in the graph. The figure shows that results of MSP-MRF are stable over different size of |L|, particularly in the high recall regions.

4.5  Hybrid Model for Over Segmentation

The proposed model effectively merges labels of each pair of nodes according to edge set E. As the number of edges increases, the size of the inferred label set will decrease from |L|, which will cover higher recall regions. Although we are interested in high recall regions, the model needs to be evaluated on high precision regions of PR curve. For this purpose, we take a hybrid model that obtains rectified segmentation results from MSP-MRF on the high recall regions but retains segmentation result of [10] on high precision regions as an unrectified baseline.

Table 1 shows performance comparison to state-of-the-art video segmentation algorithms. The proposed MSP-MRF model outperforms state-of-the-art algorithms on most of the evaluation metrics. BPR and VPR is described in Section 4.1. Optimal dataset scale(ODS) aggregates F-scores on a single fixed scale of PR curve across all video sequences, while optimal segmentation scale(OSS) selects the best F-score with different scale for each video sequence. All the evaluation metrics are followed from dataset [9]. It is worth noting that our MSP-MRF model achieves best ODS and OSS results for both BPR and VPR evaluation measurements, which are equivalent to results of Oracle. As described in Section 4.4, Oracle is a model that evaluates contour based superpixels on ground truth.

MSP-MRF infers segmentation label by integrating object boundary, global structure and temporal smoothness based on  [10]. The result shows that incorporating boundary and global structure rectifies  [10] by significant margin. It should be noted that result of  [10] is higher than previously reported in  [9]. We assume this is due to implementation updates on [10] over recent years. Qualitatively, we observe that recent implementation of [10] detects objects whose appearance is less distinctive from background, where the previous implementation could not elucidate objects under those circumstances.

5  Conclusion

In this paper, we have presented a novel video segmentation model that considers three important aspects of video segmentation. The model preserves object boundary by defining vertex set from contour based superpixels. In addition, temporal smooth label is inferred by providing unary node potential from agglomerative clustering label likelihood. Finally, global structure is enforced from pairwise edge potential on object boundary, color, optical flow motion, texture and long trajectory affinities. Experimental evaluation shows that the proposed model outperforms state-of-the-art video segmentation algorithm on most of the metrics.

 

References

[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(5):898–916, May 2011.

[2] V. Badrinarayanan, F. Galasso, and R. Cipolla. Label propagation in video sequences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.

[3] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 2008.

[4] T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(3):500–513, 2011.

[5] A. Elqursh and A. M. Elgammal. Online motion segmentation using dynamic label propagation. In IEEE International Conference on Computer Vision (ICCV), pages 2008–2015, 2013.

[6] B. Fröhlich, E. Rodner, M. Kemmler, and J. Denzler. Large-scale gaussian process multi-class classification for semantic segmentation and facade recognition. Machine Vision and Applications, 24(5):1043–1053, 2013.

[7] F. Galasso, R. Cipolla, and B. Schiele. Video segmentation with superpixels. In Asian Conference on Computer Vision (ACCV), 2012.

[8] F. Galasso, M. Keuper, T. Brox, and B. Schiele. Spectral graph reduction for efficient image and streaming video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

[9] F. Galasso, N. S. Nagaraja, T. J. Cardenas, T. Brox, and B. Schiele. A unified video segmentation benchmark: Annotation, metrics and analysis. In IEEE International Conference on Computer Vision (ICCV), December 2013.

[10] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph based video segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.

[11] A. Khoreva, F. Galasso, M. Hein, and B. Schiele. Learning must-link constraints for video segmentation based on spectral clustering. In German Conference on Pattern Recognition (GCPR), 2014.

[12] N. Komodakis and G. Tziritas. Approximate labeling via graph cuts based on linear programming. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 29(8):1436–1453, Aug. 2007.

[13] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figure-ground segments. In IEEE International Conference on Computer Vision (ICCV), 2013.

[14] B. Nadler and M. Galun. Fundamental limitations of spectral clustering methods. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2007. MIT Press.

[15] P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(6):1187 – 1200, Jun 2014.

[16] O. Pele and M. Werman. Fast and robust earth mover’s distances. In IEEE International Conference on Computer Vision (ICCV), 2009.

[17] P.Ochs and T.Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In IEEE International Conference on Computer Vision (ICCV), 2011.

[18] T.Brox and J.Malik. Object segmentation by long term analysis of point trajectories. In European Conference on Computer Vision (ECCV), Lecture Notes in Computer Science. Springer, Sept. 2010.

[19] C. Zhang, L. Wang, and R. Yang. Semantic segmentation of urban scenes using dense depth maps. In European Conference on Computer Vision (ECCV), pages 708–721, Berlin, Heidelberg, 2010. Springer-Verlag.