Video-based Face Tracking and Recognition with Visual Constraints 
We address the problem of tracking and recognition of faces in real-world, noisy videos. We identify faces using a tracker that adaptively builds the target model reflecting changes in appearance, typical of a video setting. However, adaptive appearance trackers often suffer from drifting, a gradual adaptation of the tracker to non-targets. To alleviate this problem, our tracker introduces visual constraints using a combination of local generative and discriminative models in a particle filtering framework. The generative term conforms the particles to the space of generic face poses while the discriminative one ensures rejection of poorly aligned targets. This leads to a tracker that significantly improves robustness against abrupt appearance changes and occlusions, critical for the subsequent recognition phase. Identity of the tracked subject is established by fusing pose-discriminant and person-discriminant features over the duration of a video sequence. This leads to a robust video-based face recognizer with state-of-the-art recognition performance. We test the quality of tracking and face recognition on real-world noisy videos from YouTube as well as a standard Honda/UCSD database. Our approach produces successful face tracking results on over 80% of all videos without video or person-specific parameter tuning. The good tracking performance induces similarly high recognition rates: 100% on Honda/UCSD and over 70% on the new YouTube set with 35 celebrities.
1. Face Tracking with Visual Constraints
We consider challenging cases that include significant amount of facial pose change, illumination, and occlusion. The tracking problem can be seen as an online temporal filtering, where the tracking states are represented by the affine transformation parameters.
Under the particle filtering framework, the likelihood potential plays a crucial role.
The well-known first-frame (two-frame) tracker often fails in appearance change (occlusion). Recently, the Incremental Visual Tracker (IVT) has been introduced to adapt to appearance change. It incrementally updates the appearance model based on the previous estimates. However, it still suffers from the abrupt pose change or occlusion.
In addition to the adaptive term of the IVT, our proposed tracker introduces two likelihood terms that can serve as visual constraints. The first term is the distance to the pose-specific subspace. The intuition is to restrict the candidate track to conform to the predefined facial prototypes. The third term is the SVM face crop discrimination which can quickly discard ill-cropped and ill-aligned candidates.
The following is an illustrating example that compares the proposed tracker with the IVT. (Top) Undergoing the occlusion at t = 104 ~ 106, IVT (green box) is adapted to the non-target images, while the proposed tracker (red box) survives due to the two constraint terms (pose + svm). (Bottom) Tracked data compatibility (data log likelihood) of the two trackers. Lines in red (green) are the values of -E(It) evaluated on the red (green) boxes by the proposed tracker (solid) and IVT (dashed). During the occlusion, IVT strongly adapted to the wrong target (e.g., t = 106), leading to a highly peaked data score. Consequently, at t = 108, the green particle is incorrectly chosen as the best estimate. Visual constraints restrict the adaptation to the occluding non-target, producing more balanced hypotheses that get resolved in the subsequent frames.
2. Video-based Face Recognition
In video-based face recognition, we assume that both train and test data are sequences of frames, where each frame is a well-cropped and well-aligned face image that may be obtained from the output of face tracking. One may apply static frame-by-frame face recognizers, however, there are certain drawbacks. Rather, identifying the problem as a sequence classification problem, we use a probabilistic sequence model like HMMs.
For the observation feature of the HMM, we project the image onto the offline-trained pose-discriminant LDA subspace. The pose space presents an appealing choice for the latent space. Unlike arbitrary PCA-based subspaces, the pose space may allow the use of well-defined discriminative pose features in the face recognition HMM. This indeed gives better result than PCA-based features (See Table below). Moreover, our recognizer can easily be enriched with additional observation features that may further improve the recognition accuracy. We introduce the so-called LMT (i.e., LandMark Template) features. The LMT features consist of multi-scale Gabor features (at 6 scales and 12 orientations) applied to 13 landmark facial points that are locally searched within the bounding box starting from the tracked state. Since the LMT features are high (~1000) dimensional, we used PCA to extract only 10 major factors. We concatenate the LMT features with the pose-discriminating LDA features to form an observation feature vector in our recognizer. For the Honda/UCSD dataset , the table below shows the recognition accuracies of the proposed model (LDA+LMT and LDA-Only), manifold-based approaches [1,2], and the static frame-by-frame approaches.
The following illustrates how the pose change in the video can affect the subject prediction (as well as pose prediction). The top row shows an example face sequence, the second row gives the pose prediction, P(st|x1, …, xt), and the bottom two rows depict the subject prediction, P(y|x1, …, xt), in historical and histogram views. The pose is predicted correctly changing from frontal to R-profile. The true class is Danny. It is initially incorrect as Ming (blue/dashed curve in the third row). But, as time goes by, the red/solid curve overtakes Ming, and finally it is predicted correctly.
- K.-C. Lee, J. Ho, M.H. Yang, and D. Kriegman, Video-based face recognition using probabilistic appearance manifolds, Computer Vision and Pattern Recognition (CVPR), 2003
- K.-C. Lee, J. Ho, M.H. Yang, and D. Kriegman, Visual tracking and recognition using probabilistic appearance manifolds, Computer Vision and Image Understanding, 2005
-  M. Kim, S. Kumar, V. Pavlovic and H. Rowley. “Face Tracking and Recognition with Visual Constraints in Real-World Videos”. IEEE Conf. Computer Vision and Pattern Recognition. 2008.
See Software & Data for details on how to obtain the dataset used in this work.