**1. Abstract**

Tracking human faces has remained an active research area among the computer vision community for a long time due to its usefulness in a number of applications, such as video surveillance, expression analysis and human-computer interaction. An automatic vision-based tracking system is desirable and such a system should be capable of recovering the head pose and facial features, or facial actions. It is a non-trivial task because of the highly deformable nature of faces and their rich variability in appearances.

A popular approach for face modeling and alignment is using statistical models such as Active Shape Models and Active Appearance Models. These techniques have been refined over long period of time and proven to be really robust. However, they were originally developed to work on 2D texture and require intensive preparation of training data. Using 3D morphable model on the other hand is another approach. In these techniques, a 3D facial shape model is deformed to fit to input data. These trackers rely on either texture or depth, not taking advantages of both sources of information or using them sparsely. In addition, sophisticated trackers use specially designed 3D face models which are not freely available. Lastly, they often require prior training or manual initial alignment of the face model performed by human operators.

In this work, we propose a hybrid on-line 3D face tracker to take advantages of both texture and depth information, which is capable of tracking 3D head pose and facial actions simultaneously. First, we employ a generic deformable model, the Candide-3, into our ICP fitting framework. Second, we introduce a strategy to automatically initialize the tracker using the depth information. Lastly, we propose a hybrid tracking framework that combines ICP and OAM to utilize the strengths of both techniques. The ICP algorithm, which is aided by optical flow to correctly follow large head movement, robustly tracks the head pose across frames using depth information. It provides a good initialization for OAM. In return, the OAM algorithm maintains the texture model of the face, adjusts any drifting incurred by ICP and transforms the 3D shape closer to correct deformation, which then provides ICP with a good initialization in the next frame.

2. Parameterized Face Model

We use an off-the-shell 3D deformable model, Candide-3, which was developed by J. Ahlberg [1]. The deformation of the face model is controlled by Shape Units (SUs) which represent face biometry specific to a person, and Action Units (AUs) which control facial expressions and are user-invariant. Since every vertex can be transformed independently, each vertex of the model is reshaped according to: \[g = p_0 + S\sigma + A\alpha \] where $p_0$ is the base coordinates of a vertex {\it p}, S and A are shape and action deformation matrices associated with vertex {\it p}, respectively. $\sigma$ is the vector of shape deformation parameters and $\alpha$ is the action deformation parameters vector. In general, the transformation of a vertex given global motion including rotation {\it R} and translation {\it t} is defined as: \[p’ = R(p_0 + S\sigma + A\alpha ) + t \]

We use the first frame to estimate the SU parameters corresponding to the test subject in neutral expression, together with initial head pose. From the second frame onwards , we keep shape unit parameters $\sigma$ unchanged and track the action unit parameters $\alpha$, along with head pose {\it R} and {\it t}. 7 action units are tracked in our framework as depicted below.

3. Initialization

The initialization pipeline is described in the following figure:

First, using a general 2D face alignment algorithm, we can reliably detect 6 features points (eye/mouth corners) as shown below

These 2D points are back-projected to world coordnates to form a set of 3D correspondences using the depth map. Then using the registration technique in [2], we recover the initial head pose. We use some heuristics to guess initial shape parameters by searching for facial parts (nose, chin). Lastly, we jointly optimize pose and shape unit parameters by minimizing the the following ICP energy:

\[\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}

\over R} ,\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}

\over t} ,\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}

\over \sigma } = \mathop {\arg \min }\limits_{R,t,\sigma } \sum\limits_{i = 1}^N {{{\left\| {R({p_0} + {S_i}\sigma ) + t – {d_i}} \right\|}^2}} \]

Levenberg-Marquardt algorithm is used to solve the above non-linear least squares problem [3].

**4. Tracking**

The overall tracking process is given in the below diagram:

The tracking process starts with minimizing the ICP energy to recover the head pose and action unit parameters. The procedure is similar to Algorithm 1, with only one change: in the first iteration, the correspondences are formed by optical flow tracking of the 2D-projected vertex features from the previous color frame to the current color frame. From the second iteration, correspondences are found by searching for closest points.

Optical flow inherently introduces drifting into tracking, and the error accumulated over time will certainly reduce the tracking performance. Thus we incorporate On-line Appearance Model as a refinement step in our tracker using the full facial texture information while maintaining the no-training requirement.

The On-line Appearance Model in our tracker is similar to that of [4], in which:

-The appearance model is represented in a fixed-sized template.

-The mean appearance is built on-line for the current user after the 1st frame

– Each pixel in the template is modeled by an independent Gaussian distribution and thus the appearance vector is a multivariate Gaussian distribution which is updated over time:

\[{\mu _{{i_{t + 1}}}} = \left( {1 – \alpha } \right){\mu _{{i_t}}} + \alpha {\chi _{{i_t}}} \]

\[\sigma _{{i_{t + 1}}}^2 = \left( {1 – \alpha } \right)\sigma _{{i_t}}^2 + \alpha {\left( {{\chi _{{i_t}}} – {\mu _{{i_t}}}} \right)^2} \]

The final transformation parameters are found by minimizing the Mahalanobis distance (*u* is the *(R, t, α)* parameters vector)

\[{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}

\over u} }_t} = \mathop {\arg \min }\limits_{{u_t}} {\sum\limits_{i = 1}^n {\left( {\frac{{\chi {{({u_t})}_i} – {\mu _{{i_t}}}}}{{{\sigma _{{i_t}}}}}} \right)} ^2} \]

**5. Experiments**

5.1. Synthetic Data

Our single-threaded C++ implementation can run at up to 16fps on a 2.3Ghz Intel Xeon CPU, unfortunately that’s not fast enough to run on live stream. We generate 446 synthetic RGBD sequences from BU-4DFE dataset [5] where the initial frames contain neutral expression, with white noise applied to the depth maps. The size of the rendered face is about 200×250 pixels.

We compare the results of our tracker to a pure ICP-based tracker whose resulting parameters are clamped within predefined boundaries to prevent drifting. The errors shown in Table 1 do not truly reflect the superior performance of the hybrid tracker over the ICP tracker as seen in the figure.

**5.2. Real RGB-D sequences**

We capture sequences from a Kinect and a Senz3D cameras. In the Kinect sequence, the depth map is aligned to the color image, and our tracker performs really well.

In the sequence captured from the Senz3D camera, due to the disparity between the texture and the depth map resolutions, we map the texture to the depth map instead – the generated texture thus becomes very noisy but the tracker can still works reasonably.

**Publication**

- H.X. Pham and V. Pavlovic, “Hybrid On-line 3D Face and Facial Actions Tracking in RGBD Video Sequences” In: Proc. International Conference on Pattern Recognition (ICPR). (2014)

**References**

- [1] J. Ahlberg, “An updated parameterized face” Image Coding Group,Dept. of Electrical Engineering, Linkoping University, Tech. Rep.
- [2] K. S. Arun, T. S. Huang, and S. D. Blostein, “Least-squares fitting of two 3d point sets,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, no. 5, pp. 698–700, 1987.
- [3] A. W. Fitzgibbon, “Robust registration of 2d and 3d point sets,” Image and Vis. Comput., no. 21(13-14), pp. 1145–1153, 2003.
- [4] F. Dornaika and J. Orozco, “Real-time 3d face and facial feature tracking,” J. Real-time Image Proc., pp. 35–44, 2007.
- [5] L. Yin, X. Chen, Y. Sun, T. Worm and M. Reale, “A High-Resolution 3D Dynamic Expression Database”, in IEEE FG’08, 2008.