Face Recognition – Videos




trellis 70

Al Pacino

Angelina Jolie

Bill Gates

Jodie Foster

Julia Roberts

Steven Spielberg

Tony Blair

Tracking and Recognition – Bill Clinton

Tracking and Recognition – Jodie Foster

Tracking and Recognition – Tony Blair

Face Recognition

Video-based Face Tracking and Recognition with Visual Constraints [1]

We address the problem of tracking and recognition of faces in real-world, noisy videos. We identify faces using a tracker that adaptively builds the target model reflecting changes in appearance, typical of a video setting. However, adaptive appearance trackers often suffer from drifting, a gradual adaptation of the tracker to non-targets. To alleviate this problem, our tracker introduces visual constraints using a combination of local generative and discriminative models in a particle filtering framework. The generative term conforms the particles to the space of generic face poses while the discriminative one ensures rejection of poorly aligned targets. This leads to a tracker that significantly improves robustness against abrupt appearance changes and occlusions, critical for the subsequent recognition phase. Identity of the tracked subject is established by fusing pose-discriminant and person-discriminant features over the duration of a video sequence. This leads to a robust video-based face recognizer with state-of-the-art recognition performance. We test the quality of tracking and face recognition on real-world noisy videos from YouTube as well as a standard Honda/UCSD database. Our approach produces successful face tracking results on over 80% of all videos without video or person-specific parameter tuning. The good tracking performance induces similarly high recognition rates: 100% on Honda/UCSD and over 70% on the new YouTube set with 35 celebrities. 

1. Face Tracking with Visual Constraints

We consider challenging cases that include significant amount of facial pose change, illumination, and occlusion. The tracking problem can be seen as an online temporal filtering, where the tracking states are represented by the affine transformation parameters.

Under the particle filtering framework, the likelihood potential plays a crucial role.

The well-known first-frame (two-frame) tracker often fails in appearance change (occlusion). Recently, the Incremental Visual Tracker (IVT) has been introduced to adapt to appearance change. It incrementally updates the appearance model based on the previous estimates. However, it still suffers from the abrupt pose change or occlusion.

In addition to the adaptive term of the IVT, our proposed tracker introduces two likelihood terms that can serve as visual constraints. The first term is the distance to the pose-specific subspace. The intuition is to restrict the candidate track to conform to the predefined facial prototypes. The third term is the SVM face crop discrimination which can quickly discard ill-cropped and ill-aligned candidates.

The following is an illustrating example that compares the proposed tracker with the IVT. (Top) Undergoing the occlusion at t = 104 ~ 106, IVT (green box) is adapted to the non-target images, while the proposed tracker (red box) survives due to the two constraint terms (pose + svm). (Bottom) Tracked data compatibility (data log likelihood) of the two trackers. Lines in red (green) are the values of -E(It) evaluated on the red (green) boxes by the proposed tracker (solid) and IVT (dashed). During the occlusion, IVT strongly adapted to the wrong target (e.g., t = 106), leading to a highly peaked data score. Consequently, at t = 108, the green particle is incorrectly chosen as the best estimate. Visual constraints restrict the adaptation to the occluding non-target, producing more balanced hypotheses that get resolved in the subsequent frames.

2. Video-based Face Recognition

In video-based face recognition, we assume that both train and test data are sequences of frames, where each frame is a well-cropped and well-aligned face image that may be obtained from the output of face tracking. One may apply static frame-by-frame face recognizers, however, there are certain drawbacks. Rather, identifying the problem as a sequence classification problem, we use a probabilistic sequence model like HMMs.

For the observation feature of the HMM, we project the image onto the offline-trained pose-discriminant LDA subspace. The pose space presents an appealing choice for the latent space. Unlike arbitrary PCA-based subspaces, the pose space may allow the use of well-defined discriminative pose features in the face recognition HMM. This indeed gives better result than PCA-based features (See Table below). Moreover, our recognizer can easily be enriched with additional observation features that may further improve the recognition accuracy. We introduce the so-called LMT (i.e., LandMark Template) features. The LMT features consist of multi-scale Gabor features (at 6 scales and 12 orientations) applied to 13 landmark facial points that are locally searched within the bounding box starting from the tracked state. Since the LMT features are high (~1000) dimensional, we used PCA to extract only 10 major factors. We concatenate the LMT features with the pose-discriminating LDA features to form an observation feature vector in our recognizer. For the Honda/UCSD dataset [2], the table below shows the recognition accuracies of the proposed model (LDA+LMT and LDA-Only), manifold-based approaches [1,2], and the static frame-by-frame approaches.

The following illustrates how the pose change in the video can affect the subject prediction (as well as pose prediction). The top row shows an example face sequence, the second row gives the pose prediction, P(st|x1, …, xt), and the bottom two rows depict the subject prediction, P(y|x1, …, xt), in historical and histogram views. The pose is predicted correctly changing from frontal to R-profile. The true class is Danny. It is initially incorrect as Ming (blue/dashed curve in the third row). But, as time goes by, the red/solid curve overtakes Ming, and finally it is predicted correctly.





  • K.-C. Lee, J. Ho, M.H. Yang, and D. Kriegman, Video-based face recognition using probabilistic appearance manifolds, Computer Vision and Pattern Recognition (CVPR), 2003
  • K.-C. Lee, J. Ho, M.H. Yang, and D. Kriegman, Visual tracking and recognition using probabilistic appearance manifolds, Computer Vision and Image Understanding, 2005



  • [1] M. Kim, S. Kumar, V. Pavlovic and H. Rowley. “Face Tracking and Recognition with Visual Constraints in Real-World Videos”. IEEE Conf. Computer Vision and Pattern Recognition. 2008.

See Software & Data for details on how to obtain the dataset used in this work.


Image Segmentation

Image segmentation is one of the most important steps leading to the analysis of image data. The goal is dividing the image into parts that have homogeneous attributes, and have a strong correlation with objects or areas of the real world contained in the image. Region-based segmentation methods, e.g., Markov Random Fields (MRFs), are usually robust to noise and easy to capture contextual dependencies, but often generate rough boundaries and hard to incorporate shape and topology constraints. On the other hand, edge-based segmentation methods, e.g., deformable models, are usually easy to incorporate shape prior and object topology, but sensitive to noise (false edges and weak edges) and, sometimes, initializations.

In this project, we are combining deformable models and Markov random fields using a graphical model framework for better image segmentation. The integrated framework takes advantages of both models and generate better segmentation results in many cases.

The tightly coupled model :

The exact (yet intractable) Inference

The variational inference (to decouple the original intractable model inference)

The extended MRF model (solved by the belief propagation algorithm)

The probabilistic deformable model

The optimal variational parameters

The whole segmentation algorithm is an EM algorithm solving the above equations iteratively:


More results in our paper  [1]

  • A “more” tightly-coupled model (belief propagation inference can be performed in the whole model instead of using the variational inference) [2]
  • An extension to 3D segmentation [3]


  • [1] R. Huang, V. Pavlovic and D. N. Metaxas. “A graphical model framework for coupling MRFs and deformable models”. Proc. CVPR. 2004.
  • [2] R. Huang, V. Pavlovic and D. N. Metaxas. “A Hybrid Framework for Image Segmentation Using Probabilistic Integration of Heterogeneous Constraints”. Computer Vision for Biomedical Image Application: Current Techniques and Future Trends. 2005.
  • [3] R. Huang, V. Pavlovic and D. N. Metaxas. “A tightly coupled region-shape framework for 3D medical image segmentation”. Int’l Symposium Biomedical Imaging. 2006.



Shape Modeling

Shape analysis is an important process for many computer vision applications, including image classification, recognition, retrieval, registration, segmentation, etc. An ideal shape model should be both invariant to global transformations and robust to local distortions. In this work we developed a new shape modeling framework that achieves both efficiently. A shape instance is described by a curvature-based shape descriptor. A Profile Hidden Markov Model (PHMM) is then built on such descriptors to represent a class of similar shapes. PHMMs are a particular type of Hidden Markov Models (HMMs) with special states and architecture that can tolerate considerable shape contour perturbations, including rigid and non-rigid deformations, occlusions, and missing parts. The sparseness of the PHMM structure provides efficient inference and learning algorithms for shape modeling and analysis. To capture the global characteristics of a class of shapes, the PHMM parameters are further embedded into a subspace that models long term spatial dependencies. The new framework can be applied to a wide range of problems, such as shape matching/registration, classification/recognition, etc. Our experimental results demonstrate the effectiveness and robustness of this new model in these different settings. [1]


[1] R. Huang, V. Pavlovic and D. N. Metaxas. “Embedded Profile Hidden Markov Models for Shape Analysis”. IEEE Int’l Conf. Computer Vision. 2007.

Nonlinear Dimensionality Reduction for Regression

Nonlinear Dimensionality Reduction for Regression [1]

The task of dimensionality reduction for regression (DRR) is to find a low dimensional representation, z (q-dim), of the input covariates, x (p-dim), with q << p, for regressing the output, y (d-dim), given n i.i.d. data {(xi, yi)}. DRR is mainly useful for: (1) visualization of high dimensional data, (2) efficient regressor design with a reduced input dimension, and (3) elimination of noise in data x by uncovering the essential information z for predicting y. It should be noted that DRR is not tied to a particular regression estimation method, and can be rather seen as a prior task to the regressor design for a better understanding of data.

DRR is different from other well-known dimensionality reduction algorithms. To clarify, one can categorize DRR as a supervised technique with a real multivariate label y. On the other hand, most supervised techniques are devoted for the classification setting (i.e., discrete y), which includes Linear Discriminant Analysis (LDA), kernel LDA, the general graph embedding, and the metric learning. The unsupervised dimension reduction framework even assumes that y is unknown, subsuming the principal subspace methods (PCA and kernel PCA), the nonlinear locality-preserving manifold learning (LLE, ISOMAP, and Laplacian Eigenmap), and the probabilistic methods like GPLVM.

The crucial notion related to DRR is the sufficiency in  dimension reduction (SDR), which states that one has to find the linear subspace bases B = [b1, …, bq] where bi is a p-dim vector (in the nonlinear case, B = {b1(), …, bq()}, where bi() is a nonlinear basis function) such that y and x are conditionally independent given BTx. As this condition implies that the conditional distribution of y given x equals to that of y given z = BTx, the dimension reduction entails no loss of information for the purpose of regression.  It is known that such B always exists with non-unique solutions. Hence we are naturally interested in the minimal subspace or the intersection of all such subspaces, often called the central subspace (Although the subspace is usually meant for a linear case, however, we abuse the term for both linear and nonlinear cases).

Typically, two schools of approaches have been suggested to find the central subspace: the inverse regression (IR) [1,3] and the kernel dimension reduction (KDR) [2,4]. KDR in [2] directly reduces the task of imposing conditional independence to the optimization problem that minimizes the conditional covariance operator in RKHS (reproducing kernel Hilbert space). This is achieved by quantifying the notion of conditional dependency (between y and x given BTx) by the positive definite ordering of the expected covariance operators in what is called the probability-determining RKHS (e.g., the RBF kernel-induced Hilbert space).

Although KDR formulates the problem in RKHS, the final projection is  linear in the original space. For the nonlinear extension, [4] proposed the manifold KDR which first maps the original input space to a nonlinear manifold (e.g., by Laplacian Eigenmap learned from x only), and applies the KDR to find a linear subspace in the manifold. However, this introduces a tight coupling between the central space and the manifold learned separately, which restricts itself to a transduction setting only. That is, for a new input point, one has to rebuild the manifold entirely with data including the new point. Moreover, both  methods do not have closed-form solutions, resorting to a gradient-based optimization.

The inverse regression (IR) is another interesting framework for DRR. IR is based on the fact that the inverse regression, E[x|y], lies on the subspace spanned by B (the bases of the central subspace), provided that the marginal distribution of x is ellipse-symmetric (e.g., a Gaussian). Thus B coincides with the principal directions in the variance of the inverse regression, namely, V(E[x|y]). In [1], this variance was estimated by slicing the output space (i.e., clustering on y), thereby known as sliced IR (or SIR).

Despite its simplicity and closed-form solution, SIR assumes a linear central subspace, with a strong restriction on the marginal distribution of x. To cope with the limitation, a natural kernel extension (KSIR) was proposed in [3]. It discovers a nonlinear central subspace, and moreover allows the distribution of x to be rarely restricted. However, KSIR still resorts to slicing on y, which can result in unreliable variance estimation for high dimensional y.

In this work we propose a novel nonlinear method for DRR that exploits the kernel Gram matrices of input and output. It estimates the variance of the inverse regression under the IR framework, however, avoids the slicing by the effective use of covariance operators in RKHS. In fact, we show that KSIR is a special case of ours in that KSIR can be instantiated by a particular choice of output kernel matrix. Our approach can be reliably applied to the cases of high dimensional output, while suppressing potential noise in the output data.


We demonstrate the benefits of the proposed method in a comprehensive set of evaluations on several important regression problems that often arise in the computer vision areas:

a) Estismation of head pose from images with varying illumination condition

  • input x: (64 x 64) face images with diverse lighting directions
  • output y: 2D head pose (left/right and up/down rotation angles)
  • central subspace dim: 2
Central subspace obtained from the proposed “COIR”

Central subspace from KSIR

b) Human body pose estimation from a silhouette image

  • input x: (160 x 100) silhouette image at a side view
  • output y: 59-dim 3D joint angles at articulation points
  • central subspace dim: 2

Selected frames from a half walking cycle

Central subspaces; The proposed method denoted by “COIR”


[1] K.-C. Li, Sliced inverse regression for dimension reduction, Journal of the American Statistical Association, 1991 
[2] K. Fukumizu, F. R. Bach, and M. I. Jordan, Dimensionality reduction for supervised Learning with reproducing kernel Hilbert spaces, Journal of Machine Learning Research, 2004 
[3] H. M. Wu, Kernel sliced inverse regression with applications on classification, ICSA Applied Statistics Symposium, 2006 
[4] J. Nilsson, F. Sha, and M. I. Jordan, Regression on manifolds using kernel dimension reduction, International Conference on Machine Learning (ICML), 2007


  • [1] M. Kim and V. Pavlovic. “Dimensionality Reduction using Covariance Operator Inverse Regression”. IEEE Conf. Computer Vision and Pattern Recognition. 2008.


Gaussian Process Manifold Kernel Dimensionality Reduction (GPMKDR) [1]

Constructing optimal regressors is an important task in computer vision. Problems such as tracking, 3D human and general object pose estimation, image and signal denoising, illumination direction estimation are but some of the problems that can be formulated in the regression setting. 
In this work we consider the task of discovering low-dimensional manifolds that preserve information relevant for a general nonlinear regression. We have proposed a novel dimension reduction approach called Gaussian Process Manifold Kernel Dimensional Reduction (GPMKDR), induced by reformulating the manifold kernel dimensional reduction (mKDR) in the Gaussian Process (GP) framework. In this framework, a closed-form solution for mKDR is given by the maximum eigenvalue-eigenvector solution to a kernelized problem. 

Sufficient Dimensionality Reduction (SDR)

Gaussian Process Manifold KDR

Results :

Experiments on two digit datasets

Result of Illumination Estimation

Result of Human Motion Estimation


  • [1] K. Moon and V. Pavlovic. “Visual inference using Gaussian process manifold kernel dimensionality reduction”. 2008.

Boosted Bayesian Network Classifiers

Discriminative Graphical Models

In Collaboration with Dr. James M. Rehg and Yushi Jing, College of Computing, Georgia Institute of Technology.

Discriminative learning, or learning for classification, is a common learning task that has been addressed in a number of different frameworks.  One may design a complex classifier, such as a support vector machine, that explicitly minimizes classification error. Alternatively, a set of weak learners can be trained using the boosting algorithm [Schapire97].  However, one may be explicitly interested in constructing a generative model for classification, such as a Bayesian network. The option in that case is to discriminatively train this generative model.  Unfortunately, discriminative training of generative models is computationally complex [Friedman97,Greiner02,Grossman04].  On the other hand, if the model is trained in a generative ML fashion its strong reliance on correct structure and independence assumption often undermine its classification ability. 

In this project we study a new framework for discriminative training of generative models.  Similar to traditional boosting, we recursively learn a set classifiers, this time constructed from generative models. Unlike boosting, where weak classifiers are trained discriminatively, the ‘weak classifiers’ in our method are trained generatively, to maximize the likelihood of the weighted data.  This approach has two distinct benefits.  First, our classifiers are constructed from generative models.  This is important in many practical cases when generative models, such as Bayesian networks or HMM, are desired or appropriate (e.g., sequence modeling).  Second, the ML training of generative models is computationally more efficient than discriminative training of the same. Therefore, by discriminatively setting the weights on the data and by generatively training intermediate models we achieve a computationally efficient way of training generative classifiers.




  1. Jing, Y., Pavlovic, V. & Rehg, J.M. (2008), “Boosted Bayesian network classifiers”, Machine Learning Journal.
  2. Jing, Y., Pavlovic, V. & Rehg, J.M. (2005) *Tech-report version (GIT-GVU-05-23)*  (Includes a preliminary analysis of boosted Dynamic Bayesian Network Classifiers, as an alternative to discriminative training methods like Conditional Random Fields )
  3. Jing, Y., Pavlovic, V. & Rehg, J.M. (2005) Efficient discriminative learning of Bayesian network classifiers via Boosted Augmented Naive Bayes – Proceedings of International Conference on Machine Learning (ICML 2005), Distinguished Student Paper Award.
  4. Jing, Y., Pavlovic, V. & Rehg, J.M. (2005) Discriminative Learning Using Boosted Generative Models – The Learning Workshop at Snowbird, Utah, April 5-8.

Software and Data

We have developed a C++ library for structure and parameter learning in boosted augmented Bayesian networks.  Please see our Software page for more details.

Classifying Brain Signals


The brain mechanisms underlying the ability of humans to process faces have been studied extensively in the last two decades. Brain imaging techniques, particularly fMRI (functional Magnetic Resonance Imaging) that possesses high spatial resolution but limited temporal resolution, are advancing our knowledge of the spatial and anatomical organization of face-sensitive brain areas. At the other end are EEG recording techniques with high temporal resolution but poor spatial resolution. They reveal event-related potentials (ERPs) that serve as correlates for various aspects of facial processing. The best-established ERP marker of face perception is N170, a negative component that occurs roughly 170 ms after stimulus onset. Other markers, such as N250, N400 and P600, which occur later than N170, are considered to contribute to face recognition and identification. In a typical fMRI or ERP study, signals are recorded while the subject is exposed to a face or a non-face stimulus to isolate the brain areas that respond differentially to faces (fMRI), and the temporal intervals that exhibit major differences between face and a non-face responses in the data (ERP). Early studies in both fMRI and ERP employed simplistic signal processing techniques, involving multiple instance averaging and then a manual examination to detect differentiating components. More advanced statistical signal analysis techniques were first applied to fMRI signals (for a review see. Systematic analyses generally lagged behind in the ERP domain. A recent principled approach is by Moulson et al, in which the authors applied statistical classification in the form of Linear Discrimination Analysis (LDA) to face/non-face ERPs and obtained classification accuracies of about 70%.

There are two major goals for the current ERP study: First, to emphasize higher brain areas of visual processing at the expense of early visual areas by using a strategically designed control non-face stimulus. Second, and more importantly, this study seeks to apply systematic machine learning and pattern recognition techniques for classifying face and non-face responses in both the spatial and temporal domains. Toward the first major goal, we use a face and a non-face stimulus derived from the well-known vase/faces illusion.

The key feature of the face and non-face stimuli is that they share the same defining contours, differing only slightly in stimulus space. The defining contours  are attributed to whatever forms the figure; as the percept alternates spontaneously, these contours are attributed alternately to the faces or to the vase.

This attribution is biased, using minimal image manipulations, toward the vase in or the faces, but the same contours are used in both cases. These shared contours result in relatively similar responses for the faces and vase stimuli and produce a difficult classification task for the ensuing ERP signals. With this choice of face and non-face stimuli, we bias the analysis towards detecting signal differences that are elicited more by the high-level percepts (face or non-face) rather than low-level image differences.

With respect to the second major goal, we note that several types of classifiers have been used in previous EEG classification studies: kNN, logistic regression and multi layer perceptron, support vector machines and LDA. In authors have used group penalty for lasso regression in frequency domain for EEG signal classification and showed the utility of grouping in that domain. The present study extends this systematic use of classifiers by using the ERP signals obtained with the faces/vase stimuli to test three major classification schemes: kNN, L1-norm logistic regression, and group lasso logistic regression. We perform the classification analysis between two classes of face and vase responses agnostically, based on purely statistical estimates, without favoring any sensors or temporal intervals. Our goal is to use the data to point to salient spatio-temporal ERP signatures most indicative of the stimuli classes. Obviously, ERP classification can provide important applications in neuroscience, such as in brain-computer interfaces (BCI) and in detecting pathologies such as epilepsy.

The main results of our tests with the three major classifier schemes are: First, kNN produced the worst performance, having to rely on all, potentially noisy, features. Second, the other two schemes were able to classify the signals with an overall accuracy of roughly 80%. Finally, the learned weights of L1-norm logistic regression and group lasso were able to locate the salient features in space (electrode position) and time in close agreement with the accepted wisdom of previous studies, confirming various markers of face perception such as N170, N400 and P600.


Applying sparse dictionary methods not only improved the classification performance but also pointed out the spatial and temporal regions of high interest. That is, the most discriminant features in time and space (channels) stand for ERPs and active regions of brain that are resposible for distinguishing between a face and non-face. Following figures illustrate them.

Mapping of L1 regression coefficients to channels (first figure) and time (second figure)


Related Publications

  • [1] S. Shariat, V. Pavlovic, T. Papathomas, A. Braun and P. Sinha. “Sparse dictionary methods for EEG signal classification in face perception”. Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on. 2010. pp. 331 – 336.


Protein Classification

Protein Classification and Functional Prediction 

Proteins are linear seqeuences of amino acids that perform essential functions in organisms. To determine the function of a protein, expensive experimental procedures are required. With better sequencing techniques, it becomes evident that computational prediction/classfication of an unknown protein’s function will help drive the labor and cost down. . 

Protein remote homology detection involves detecting functionally related proteins with very little sequence identity. The task is challanging due to two major reasons: first, the length of the sequences vary and second, remotely homologous proteins need not share great sequence identity. Some share sequence identity as little as 30%. 


Previous studies [1,2,3] show that the following features are sufficient to describe superfamilies in the Immunoglobulin-like beta-sandwich fold: 
• a set of critical positions, and 
• the preferred amino acid residues in these positions, and 
• the distance between each neighboring pair of critical positions. 
Our goal is to find computational methods, with good classification performance, that are able to recover these critical positions as well as the preferred residues. 

Our approaches
We attempt to recover these critical positions by employing different approaches. In our earlier work, we used a generative model; At present, we use a discriminative model. 

The generative approach
We employ a class of models, known as duration-explicit hidden Markov models, in an attempt to recover the critical positions. We call our models sparse profile hidden markov models and show the structure in Fig. 1. The structure of our model resembles that of a profile hidden Markov model. However, the crucial difference between our model and the profile hidden Markov model is that, while the number of symbols an insertion state (diamond shape) emits is represented by a geometric distribution, in our model, the number of symbols an insertion state emits can be explicitly modeled by any arbitrary distribution. Moreover, since we only care about the critical positions, we use the background distribution to represent the emission probabilities in the insertion states in order to express our indifference in the residue composition between critical positions. Our match states attempt to capture the critical positions. 

Our approach of focusing on the critical positions and modeling the emission probabilities of the insertion states as background probabilities results in extremely sparse models. Our publication [5] indicates that approximately m/4 critical positions are sufficient to describe a superfamily with m positions. Fig. 2 and 3 show the performance, in terms of recall (red curve) and precision (blue curves) of the sparse profile HMM on two different superfamilies: Beta-Galactosidase and Cadherin, as a function of number of critical positions (horizontal axis). It can be seen that the performance of the sparse profile HMM starts showing severe degradation once the number of critical positions falls lower than 25%. 

Our main contribution to this project is that our results are consistent with previous established studies [1,2,3], that a set of critical positions, the preferred symbols on these positions and the distance between each neighboring positions are sufficient to describe the superfamilies in the Immunoglobulin-like beta-sandwich fold. Moreover, our models are extremely sparse and the main advantage of a sparse model is that fewer parameters need to be estimated and thus the model is less prone to overfitting. Our results indicate that a 1:4 compression ratio can be achieved.

The discriminative approach
Though in most problem domains a generative model performs quite well, it does come with some disadvantages. Denote X as the sequence (example) and Y as the label of the sequence. First, one needs to assume a parametric form P(X) and when the parametric assumption is invalid, one risks injecting bias into the model. Second, building a generative model for a superfamily requires only positive sequences, sequences that are known to belong to this superfamily; no negative sequences participated in the process of estimation. As a result, if one picks the important features indicated by the built generative model, one cannot guarantee that these features do not appear in the negative examples. 

Discriminative models do not suffer from the shortcomings listed above. First, no parametric form of P(X) is required. Second, it focuses on building the decision boundary between the positive and negative classes (meaing both positive and negative examples participate in the estimation process) and thus it guarantees that the important features only appear in one class but not the other. 

However, use of discriminative models for protein remote homology detection presents a big challange: proteins are of variable length but the discriminative model requires fixed-length features. To tackle the problem, Jaakkola et al. [6,7] proposed to use a hyprid model: a generative model, trained using positive sequences only acts as a feature extractor, extracting features from positive and negative sequences and the features are fed to to a discriminative model to perform classification. The features are the gradient of the log likelihood of the sequence with respect to the model parameters. On the other hand, Leslie et al. [8,9] proposed to represent each sequence by the frequency of all k-long substrings in a |\Sigma|^k-dimensional spectrum, where \Sigma represents the alphabet set, and use these high dimensional features to perform classification. 

We follow the same formalism proposed by Jaakkola et al. In our hybrid setting, we use a profile HMM, a generative model, as the feature extractor and the discriminative model we use is the logistic regression model. The generative models are estimated from positive sequences only and the features that we propose to use are the sufficient statistics of the symbols in each match states. These features we propose attempt to capture the critical positions and the corresponding preferred symbols. Finally, we place a Normal or Laplacian prior centered at 0, on the parameter of the logistic regression model to enforce sparsity. 

We compare our model and features with some state-of-the-art algorithms in Fig. 4 and 5. We use the ROC-50 scores to evaluate the methods. The ROC-50 score is the normalized area under the ROC curve up to the first 50 false positives. Such criteria has been reported to be more effective when comparing methods on highly imbalanced test sets, in which examples in one class, in our case, the negative examples, are far greater than the examples in the other class. In Fig. 4, the horizontal axis corresponds to the ROC-50 score and the vertical access denotes the number of experiments, out of 54, achieving such or better ROC-50 score. The subfigure on the right panel is a more detailed view in the area of high ROC50 score. The figures indicate that both logistic regression models (with Normal and Laplacian priors) achieve comparable performance with the mismatch(5,1) kernel. However, we noted that, out of 54 families, there are 480 features on average. The logistic regression model with Laplacian prior selects 43 features per family on average, which shows more than 90% reduction of features. 

In Fig. 5, we show the pairwise performance comparison on different methods. In the left panel, we compare the mismatch(5,1) kernel (vertical axis) with the logistic regression model with Normal prior. Points falling above the diagonal correspond to an experiment in which the logistic model with Normal prior performs better than the mismatch(5,1) kernel. There are 30 and 22 points above and below the diagonal, respectively. The two-sided p-value of the sign test is 0.33, which indicates that the performance of the two models are comparable. In the left panel, we compare the effect of different priors on the logistic regression model. The priors are Normal prior (horizontal axis) and Laplacian prior (vertical axis). There are 25 and 26 points above and below the diagonal, respectively. The two-sided p-value of the sign test is approximately 1, suggesting that although the Laplacian prior only selects 10% of the features, this sparse set of features play important role in the classification task. 

We show the recovered critical positions and the preferred symbols in Fig. 6 and 7. The model (table) shown in Fig. 6 corresponds to the scorpion toxin-like superfamily and the tested family is Plant defensins. Out of 188 features, the logistic regression model with Laplacian prior picks 19 features around 12 positions. The ROC50 score for this experiment is 1. We also show the corresponding HMM-logo obtained from PFAM. The height of each symbol at each position corresponds to its importance suggested by the profile HMM, estimated from positive sequences only. The leading numbers in the table denotes the positions in our hmm while the number in parentheses denotes the positions in the HMM-logo. The slight variation occurs because we use a slightly different huristic to determine the structure of the profile HMM. We can see that the HMM-logo from PFAM suggests that 8 positions with Cysteine (C) residues play important role in this superfamily. However, only 5 such positions (circled in red) are captured by our model. Upon investigation, it becomes clear that the reason that the other three positions with Cysteine residues are not selected because such alignment has also been observed in the negative sequences. The model shown in Fig. 7 corresponds to the same superfamily, with a different tested family. The logistic regression model with Laplacian prior selects 14 out of 116 features and the ROC50 score of this experiment is .91. In this experiment, all 6 positions with Cysteine residues indicated by the HMM-logo are captured by our model, suggesting that such pattern is characteristic and unique to this family. 

Our main contribution to this project is that we present a set of features that are biologically intuitive and, when combined with the model we choose, yields interpretable results. The performance of our feature and model is comparable to the state-of-the-art algorithms, SVM-Fisher and string kernels, but our features and model offers biologically intuitive interpretation. With the use of a sparsity-enforcing prior such as Laplace, our models produces consistent results suggested in [1,2,3], that a sparse set of positions and the preferred residues might be sufficient to describe a superfamily, and more importantly, our model shows promising results to recover such positions and preferred residues.


  1. A. E. Kister, A. V. Finkelstein and I. M. Gelfand. Common features in structures and sequences of sandwich-like proteins. PNAS, vol. 99, no. 22, pp. 14137-14141,2002.
  2. B. Reva, A. Kister, S. Topiol, and I. Gelfand. Determining the roles of different chain fragments in recognition of immunoglobulin fold. Protein Eng., vol. 15, no. 1, pp. 13-19, 2002.
  3. A. E. Kister, M. A. Roytberg, C. Chothia, J. M. Vasiliev, and I. M. Gelfand. The sequence determinants of cadherin molecules. Protein Sci., vol. 10, no. 9, pp. 1801-1810, 2001.
  4. S. Eddy. Profile hidden Markov models. Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.
  5. P. Huang and V. Pavlovic. Protein Homology Detection using Sparse Profile Hidden Markov Models (Poster). ISMB 2006, Detroit, MI.
  6. T. Jaakkola, M. Diekhans, and D. Haussler. Using the Fisher Kernel method to detect remote protein homologies. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. AAAI Press, 1999, pp. 149-158.
  7. T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, vol. 7, no. 1-2, 2000, pp. 95-114.
  8. C. S. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for svm protein classification. Pacific Symposium on Biocomputing, 2002, pp. 566-575.
  9. C. S. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for svm protein classification. NIPS, 2002, pp. 1417-1424.

Kernel Barcoding

Kernel Methods for Sequence Classification

DNA Barcoding

Genomic approaches to classification of organisms exploit diversity among DNA sequences to identify organisms. The task of identification is to assign any unidentified organism to the taxonomic group at the target level: class, family, species, etc. Identification tasks are common in many application domains: applied taxonomy, general taxonomic systems, biosecurity, etc.

In the DNA barcoding setting, identification of organisms is performed using information from a standard gene region common across all taxa: in particular, mitochondrial genome of animals has been identified as a target for analysis. DNA barcode is a short fragment of DNA obtained from the standard region and is used as a marker for identification and classification of the species. Among many advantages of DNA barcodes are: standardization, ability to streamline exchange of data, to continually add in more species and integrate data collected from different sources. Inherent problems that need to be solved are accuracy of identification and scalability of the methods used for identification.

Our contributions:

• We show efficiency of string kernels (mismatch/spectrum) as a sequence similarity measures in the DNA barcoding setting 
• We present a framework for efficient computation of string kernels 
• We propose divide-and-conquer technique for string kernels on DNA barcode data that provides a fast and scalable solution for many string kernel computation problems 
• We develop a divide-and-conquer algorithm for the (k;m)-mismatch kernel that improves previously known bound on running time and is more memory efficient and easy to implement 

We evaluated our methods on three datasets: 

  • Fish sequences (7 species, 56 barcodes)
  • Astraptes dataset (12 species, 466 barcodes)
  • Hesperiidae dataset (371 species, 2135 barcodes)


Accuracy of identification

  • Figure 1: Astraptes data

  • Figure 2: Hesperiidae data



Comparison of running times (mismatch kernel computation, k=5, m=1)

Algorithm 1 = new algorithm, algorithm 2 = old mismatch algorithm

Note: machine configuration: 2.8Ghz CPU, 1GB RAM 
Note: order of magnitude improvement 


The use of kernel-based approaches for DNA barcoding resulted in a highly accurate and scalable species identification method. At the same time, we were able to identify signatures (small subsets of sequence features) for many species classes. Proposed algorithmic approach to string kernel computations achieves substantial running time improvements in practice. Our results show that DNA barcoding can be successfully used to identify specimens by examining few sequence features, resulting in increased scalability and interpretability of current computational approaches to barcoding and species classification tasks. 


  • Kernel methods for DNA barcoding
  • Rutgers Bioinformatics talk (Nov. 10, 2006)


  • [Jaakkola 00] Tommi Jaakkola, Mark Diekhans & David Haussler. A Discriminative Framework for Detecting Remote Protein Homologies. Journal of Computational Biology, vol. 7, no. 1-2, pages 95-114, 2000.
  • [Kuang 04] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund & Christina Leslie. Profile-Based String Kernels for Remote Homology Detection and Motif Extraction. In CSB ’04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB’04), pages 152-160, Washington, DC, USA, 2004. IEEE Computer Society.
  • [Leslie 02a] Christina S. Leslie, Eleazar Eskin & William Stafford Noble. The Spectrum Kernel: A String Kernel for SVM Protein Classification. In Pacific Symposium on Biocomputing, pages 566-575, 2002.
  • [Leslie 02b] Christina S. Leslie, Eleazar Eskin, Jason Weston & William Stafford Noble. Mismatch String Kernels for SVM Protein Classification. In NIPS, pages 1417-1424, 2002.
  • [P.D.N. 03] Hebert P.D.N., A. Cywinska, S.L. Ball & J.R. deWaard. Biological identifications through DNA barcodes. In Proceedings of the Royal Society of London, pages 313-322, 2003
  • [Steinke 05] Dirk Steinke, Miguel Vences, Walter Salzburger, and Axel Meyer. Taxi: a software tool for dna barcoding using distance methods. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1462):1975-1980, 2005.
  • [Nielsen 06] Rasmus Nielsen and Mikhail Matz. Statistical approaches for DNA barcoding. Systematic Biology, 55(1):162-169, 2006.
  • [Hebert 04] Paul D. N. Hebert, Erin H. Penton, John M. Burns, Daniel H. Janzen, and Winnie Hallwachs. Ten species in one: Dna barcoding reveals cryptic species in the neotropical skipper buttery astraptes fulgerator. In PNAS, volume 101, pages 14812-14817, 2004.
  • Software and Data

Matlab implementation of our fast string kernel algorithms for DNA barcoding is available upon request.  Please also see our Software page for more details. 

Barcode Datasets

  • The Hesperiidae dataset (hesperiidae.fasta) (see www.barcodinglife.org)
  • The Astraptes dataset (astraptes.fasta) (see [Hebert 04] and [Nielsen 06] for details on dataset)
  • The Fish dataset (fish.fasta) (see [Steinke 05] for detailed description of the dataset)