*Chongyu Chen was with Nanyang Technological University*
1. Abstract.
Existing depth recovery methods for commodity RGB-D sensors primarily rely on low-level information for repairing the measured depth estimates. However, as the distance of the scene from the camera increases, the recovered depth estimates become increasingly unreliable. The human face is often a primary subject in the captured RGB-D data in applications such as the video conference. In this work we propose to incorporate face priors extracted from a general sparse 3D face model into the depth recovery process. In particular, we propose a joint optimization framework that consists of two main steps: deforming the face model for better alignment and applying face priors for improved depth recovery. The two main steps are iteratively and alternatively operated so as to help each other. Evaluations on benchmark datasets demonstrate that the proposed method with face priors significantly outperforms the baseline method that does not use face priors, with up to 15.1% improvement in depth recovery quality and up to 22.3% in registration accuracy.
2. The proposed method.
Given a color image I and its corresponding (aligned) noisy depth map Z as input, our goal is to obtain a good depth map of the face region using the face priors derived from the general 3D deformable model. The pipeline of the proposed method is shown below.
The first two components are pre-processing steps to roughly clean up the depth data and roughly align the general face model to the input point cloud. The last two components are the core of our proposed framework. For component of the guided depth recovery, we fix the face prior and use it to update the depth, while for the last component, we fix the depth and update the face prior. The last two components alternatively and iteratively operate until convergence.
Based on [1], the energy function is formulated as
\[\min\limits_{U, u} E_r(U) + \lambda_d E_d(U) + \lambda_f E_f(U, u)\]
The first two terms are similar to [1]. The last term is the new face prior term, which is defined as following:
\[E_f(U, u) = \sum_{i \in \Omega_f} \eta_i \left( U(i) – T_f( P(u), i) \right)^2\]
U represents the depth map to be recovered, while u represents the parameters of the 3D facial deformable model and Tf is the facial shape transformation function according to u. For more details on the deformation of the face model, please refer to our previous project [2].
Considering that the guidance from the sparse vertices of the Candide model may be too weak to serve as the prior for the full (dense) depth map U, we need to generate a dense synthetic depth map Y from the aligned face prior P(u) using an interpolation process. It is possible to define different interpolation functions according to desired dense surface properties. In computer graphics, such models may use non-uniform rational basis spline (NURBS) to guarantee surface smoothness. Here, for the purpose of a shape prior we choose a simple piece-wise linear interpolation. This process is denoted as
\[Y = \text{lerp}( P(u) )\]
, which is demonstrated in the figure
To mitigate the effects of the piece-wise flat dense patches due to the linear interpolation, we introduce a weighting scheme defined through weights ηi. In particular, for each pixel Y(i), we use a normalized weight that is adaptive to the pixel’s distances from the neighboring vertices of the sparse shape P. Let (ai,bi,ci) be the barycentric coordinates of pixel i inside a triangle defined by its three neighboring vertices of P. Then, its weight is computed as
\[\eta_i = \sqrt{a_i^2+b_i^2+c_i^2}, \ i \in \Omega_f.\]
This suggests that the pixels corresponding to model vertices have the highest weight of $1$ while the weights decline towards the center of each triangular patch. An illustration of the weights is given in the figure below, where bright pixels represent large weights.
3. Energy optimization.
From the definition of the energy function, it can be seen that the overall optimization of U remains a convex task, for a given fixed prior P. However, the optimization of the face model parameter set u might not be convex since it involves rigid and non-rigid deformation. Therefore, to tackle the global optimization task which includes both the depth U and the deformation u recovery, we resort to a standard recursive alternate optimization process. In other words, we will first optimize u while keeping U fixed, and then optimize U for the fixed deformation u. Specifically, we divide our problem into three well studied subproblems: depth recovery, rigid registration, and non-rigid deformation. The algorithm is detailed below:
4. Experiments.
a. Synthetic data
We first use the BU4D Facial Expression Database [3] for quantitative evaluation. Considering that Kinect is the most popular commodity RGB-D sensor, we add some Kinect-like artifacts to the depth maps generated from the BU4D database.
We rendered the BU4DFE data to different distances: 1.2m, 1.5m, 1.75m and 2m. By using synthetic data,
we are able to obtain the ground truth for quantitative evaluation. Specifically, we measure the depth recovery
performance in terms of average pixel-wise Mean Absolute Error (MAE) in mm. The following plot compare the
results of our method to the baseline [1].
Besides the recovery error, we also evaluate the registration accuracy. To get the reference registration and shapes, we fit the 3D face model to noise-free data. The face model is also fitted to the depth maps obtained by different methods. We then compare the fitting result with the reference registration. Table 1 shows that the proposed method produces a more accurate face registration compared to the baseline method, especially in the eyes’ region and around the face boundary.
Samples at 1.75m
Samples at 2m