Protein Classification

Protein Classification and Functional Prediction 

Proteins are linear seqeuences of amino acids that perform essential functions in organisms. To determine the function of a protein, expensive experimental procedures are required. With better sequencing techniques, it becomes evident that computational prediction/classfication of an unknown protein’s function will help drive the labor and cost down. . 

Protein remote homology detection involves detecting functionally related proteins with very little sequence identity. The task is challanging due to two major reasons: first, the length of the sequences vary and second, remotely homologous proteins need not share great sequence identity. Some share sequence identity as little as 30%. 


Previous studies [1,2,3] show that the following features are sufficient to describe superfamilies in the Immunoglobulin-like beta-sandwich fold: 
• a set of critical positions, and 
• the preferred amino acid residues in these positions, and 
• the distance between each neighboring pair of critical positions. 
Our goal is to find computational methods, with good classification performance, that are able to recover these critical positions as well as the preferred residues. 

Our approaches
We attempt to recover these critical positions by employing different approaches. In our earlier work, we used a generative model; At present, we use a discriminative model. 

The generative approach
We employ a class of models, known as duration-explicit hidden Markov models, in an attempt to recover the critical positions. We call our models sparse profile hidden markov models and show the structure in Fig. 1. The structure of our model resembles that of a profile hidden Markov model. However, the crucial difference between our model and the profile hidden Markov model is that, while the number of symbols an insertion state (diamond shape) emits is represented by a geometric distribution, in our model, the number of symbols an insertion state emits can be explicitly modeled by any arbitrary distribution. Moreover, since we only care about the critical positions, we use the background distribution to represent the emission probabilities in the insertion states in order to express our indifference in the residue composition between critical positions. Our match states attempt to capture the critical positions. 

Our approach of focusing on the critical positions and modeling the emission probabilities of the insertion states as background probabilities results in extremely sparse models. Our publication [5] indicates that approximately m/4 critical positions are sufficient to describe a superfamily with m positions. Fig. 2 and 3 show the performance, in terms of recall (red curve) and precision (blue curves) of the sparse profile HMM on two different superfamilies: Beta-Galactosidase and Cadherin, as a function of number of critical positions (horizontal axis). It can be seen that the performance of the sparse profile HMM starts showing severe degradation once the number of critical positions falls lower than 25%. 

Our main contribution to this project is that our results are consistent with previous established studies [1,2,3], that a set of critical positions, the preferred symbols on these positions and the distance between each neighboring positions are sufficient to describe the superfamilies in the Immunoglobulin-like beta-sandwich fold. Moreover, our models are extremely sparse and the main advantage of a sparse model is that fewer parameters need to be estimated and thus the model is less prone to overfitting. Our results indicate that a 1:4 compression ratio can be achieved.

The discriminative approach
Though in most problem domains a generative model performs quite well, it does come with some disadvantages. Denote X as the sequence (example) and Y as the label of the sequence. First, one needs to assume a parametric form P(X) and when the parametric assumption is invalid, one risks injecting bias into the model. Second, building a generative model for a superfamily requires only positive sequences, sequences that are known to belong to this superfamily; no negative sequences participated in the process of estimation. As a result, if one picks the important features indicated by the built generative model, one cannot guarantee that these features do not appear in the negative examples. 

Discriminative models do not suffer from the shortcomings listed above. First, no parametric form of P(X) is required. Second, it focuses on building the decision boundary between the positive and negative classes (meaing both positive and negative examples participate in the estimation process) and thus it guarantees that the important features only appear in one class but not the other. 

However, use of discriminative models for protein remote homology detection presents a big challange: proteins are of variable length but the discriminative model requires fixed-length features. To tackle the problem, Jaakkola et al. [6,7] proposed to use a hyprid model: a generative model, trained using positive sequences only acts as a feature extractor, extracting features from positive and negative sequences and the features are fed to to a discriminative model to perform classification. The features are the gradient of the log likelihood of the sequence with respect to the model parameters. On the other hand, Leslie et al. [8,9] proposed to represent each sequence by the frequency of all k-long substrings in a |\Sigma|^k-dimensional spectrum, where \Sigma represents the alphabet set, and use these high dimensional features to perform classification. 

We follow the same formalism proposed by Jaakkola et al. In our hybrid setting, we use a profile HMM, a generative model, as the feature extractor and the discriminative model we use is the logistic regression model. The generative models are estimated from positive sequences only and the features that we propose to use are the sufficient statistics of the symbols in each match states. These features we propose attempt to capture the critical positions and the corresponding preferred symbols. Finally, we place a Normal or Laplacian prior centered at 0, on the parameter of the logistic regression model to enforce sparsity. 

We compare our model and features with some state-of-the-art algorithms in Fig. 4 and 5. We use the ROC-50 scores to evaluate the methods. The ROC-50 score is the normalized area under the ROC curve up to the first 50 false positives. Such criteria has been reported to be more effective when comparing methods on highly imbalanced test sets, in which examples in one class, in our case, the negative examples, are far greater than the examples in the other class. In Fig. 4, the horizontal axis corresponds to the ROC-50 score and the vertical access denotes the number of experiments, out of 54, achieving such or better ROC-50 score. The subfigure on the right panel is a more detailed view in the area of high ROC50 score. The figures indicate that both logistic regression models (with Normal and Laplacian priors) achieve comparable performance with the mismatch(5,1) kernel. However, we noted that, out of 54 families, there are 480 features on average. The logistic regression model with Laplacian prior selects 43 features per family on average, which shows more than 90% reduction of features. 

In Fig. 5, we show the pairwise performance comparison on different methods. In the left panel, we compare the mismatch(5,1) kernel (vertical axis) with the logistic regression model with Normal prior. Points falling above the diagonal correspond to an experiment in which the logistic model with Normal prior performs better than the mismatch(5,1) kernel. There are 30 and 22 points above and below the diagonal, respectively. The two-sided p-value of the sign test is 0.33, which indicates that the performance of the two models are comparable. In the left panel, we compare the effect of different priors on the logistic regression model. The priors are Normal prior (horizontal axis) and Laplacian prior (vertical axis). There are 25 and 26 points above and below the diagonal, respectively. The two-sided p-value of the sign test is approximately 1, suggesting that although the Laplacian prior only selects 10% of the features, this sparse set of features play important role in the classification task. 

We show the recovered critical positions and the preferred symbols in Fig. 6 and 7. The model (table) shown in Fig. 6 corresponds to the scorpion toxin-like superfamily and the tested family is Plant defensins. Out of 188 features, the logistic regression model with Laplacian prior picks 19 features around 12 positions. The ROC50 score for this experiment is 1. We also show the corresponding HMM-logo obtained from PFAM. The height of each symbol at each position corresponds to its importance suggested by the profile HMM, estimated from positive sequences only. The leading numbers in the table denotes the positions in our hmm while the number in parentheses denotes the positions in the HMM-logo. The slight variation occurs because we use a slightly different huristic to determine the structure of the profile HMM. We can see that the HMM-logo from PFAM suggests that 8 positions with Cysteine (C) residues play important role in this superfamily. However, only 5 such positions (circled in red) are captured by our model. Upon investigation, it becomes clear that the reason that the other three positions with Cysteine residues are not selected because such alignment has also been observed in the negative sequences. The model shown in Fig. 7 corresponds to the same superfamily, with a different tested family. The logistic regression model with Laplacian prior selects 14 out of 116 features and the ROC50 score of this experiment is .91. In this experiment, all 6 positions with Cysteine residues indicated by the HMM-logo are captured by our model, suggesting that such pattern is characteristic and unique to this family. 

Our main contribution to this project is that we present a set of features that are biologically intuitive and, when combined with the model we choose, yields interpretable results. The performance of our feature and model is comparable to the state-of-the-art algorithms, SVM-Fisher and string kernels, but our features and model offers biologically intuitive interpretation. With the use of a sparsity-enforcing prior such as Laplace, our models produces consistent results suggested in [1,2,3], that a sparse set of positions and the preferred residues might be sufficient to describe a superfamily, and more importantly, our model shows promising results to recover such positions and preferred residues.


  1. A. E. Kister, A. V. Finkelstein and I. M. Gelfand. Common features in structures and sequences of sandwich-like proteins. PNAS, vol. 99, no. 22, pp. 14137-14141,2002.
  2. B. Reva, A. Kister, S. Topiol, and I. Gelfand. Determining the roles of different chain fragments in recognition of immunoglobulin fold. Protein Eng., vol. 15, no. 1, pp. 13-19, 2002.
  3. A. E. Kister, M. A. Roytberg, C. Chothia, J. M. Vasiliev, and I. M. Gelfand. The sequence determinants of cadherin molecules. Protein Sci., vol. 10, no. 9, pp. 1801-1810, 2001.
  4. S. Eddy. Profile hidden Markov models. Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.
  5. P. Huang and V. Pavlovic. Protein Homology Detection using Sparse Profile Hidden Markov Models (Poster). ISMB 2006, Detroit, MI.
  6. T. Jaakkola, M. Diekhans, and D. Haussler. Using the Fisher Kernel method to detect remote protein homologies. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. AAAI Press, 1999, pp. 149-158.
  7. T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, vol. 7, no. 1-2, 2000, pp. 95-114.
  8. C. S. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for svm protein classification. Pacific Symposium on Biocomputing, 2002, pp. 566-575.
  9. C. S. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for svm protein classification. NIPS, 2002, pp. 1417-1424.

Kernel Barcoding

Kernel Methods for Sequence Classification

DNA Barcoding

Genomic approaches to classification of organisms exploit diversity among DNA sequences to identify organisms. The task of identification is to assign any unidentified organism to the taxonomic group at the target level: class, family, species, etc. Identification tasks are common in many application domains: applied taxonomy, general taxonomic systems, biosecurity, etc.

In the DNA barcoding setting, identification of organisms is performed using information from a standard gene region common across all taxa: in particular, mitochondrial genome of animals has been identified as a target for analysis. DNA barcode is a short fragment of DNA obtained from the standard region and is used as a marker for identification and classification of the species. Among many advantages of DNA barcodes are: standardization, ability to streamline exchange of data, to continually add in more species and integrate data collected from different sources. Inherent problems that need to be solved are accuracy of identification and scalability of the methods used for identification.

Our contributions:

• We show efficiency of string kernels (mismatch/spectrum) as a sequence similarity measures in the DNA barcoding setting 
• We present a framework for efficient computation of string kernels 
• We propose divide-and-conquer technique for string kernels on DNA barcode data that provides a fast and scalable solution for many string kernel computation problems 
• We develop a divide-and-conquer algorithm for the (k;m)-mismatch kernel that improves previously known bound on running time and is more memory efficient and easy to implement 

We evaluated our methods on three datasets: 

  • Fish sequences (7 species, 56 barcodes)
  • Astraptes dataset (12 species, 466 barcodes)
  • Hesperiidae dataset (371 species, 2135 barcodes)


Accuracy of identification

  • Figure 1: Astraptes data

  • Figure 2: Hesperiidae data



Comparison of running times (mismatch kernel computation, k=5, m=1)

Algorithm 1 = new algorithm, algorithm 2 = old mismatch algorithm

Note: machine configuration: 2.8Ghz CPU, 1GB RAM 
Note: order of magnitude improvement 


The use of kernel-based approaches for DNA barcoding resulted in a highly accurate and scalable species identification method. At the same time, we were able to identify signatures (small subsets of sequence features) for many species classes. Proposed algorithmic approach to string kernel computations achieves substantial running time improvements in practice. Our results show that DNA barcoding can be successfully used to identify specimens by examining few sequence features, resulting in increased scalability and interpretability of current computational approaches to barcoding and species classification tasks. 


  • Kernel methods for DNA barcoding
  • Rutgers Bioinformatics talk (Nov. 10, 2006)


  • [Jaakkola 00] Tommi Jaakkola, Mark Diekhans & David Haussler. A Discriminative Framework for Detecting Remote Protein Homologies. Journal of Computational Biology, vol. 7, no. 1-2, pages 95-114, 2000.
  • [Kuang 04] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund & Christina Leslie. Profile-Based String Kernels for Remote Homology Detection and Motif Extraction. In CSB ’04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB’04), pages 152-160, Washington, DC, USA, 2004. IEEE Computer Society.
  • [Leslie 02a] Christina S. Leslie, Eleazar Eskin & William Stafford Noble. The Spectrum Kernel: A String Kernel for SVM Protein Classification. In Pacific Symposium on Biocomputing, pages 566-575, 2002.
  • [Leslie 02b] Christina S. Leslie, Eleazar Eskin, Jason Weston & William Stafford Noble. Mismatch String Kernels for SVM Protein Classification. In NIPS, pages 1417-1424, 2002.
  • [P.D.N. 03] Hebert P.D.N., A. Cywinska, S.L. Ball & J.R. deWaard. Biological identifications through DNA barcodes. In Proceedings of the Royal Society of London, pages 313-322, 2003
  • [Steinke 05] Dirk Steinke, Miguel Vences, Walter Salzburger, and Axel Meyer. Taxi: a software tool for dna barcoding using distance methods. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1462):1975-1980, 2005.
  • [Nielsen 06] Rasmus Nielsen and Mikhail Matz. Statistical approaches for DNA barcoding. Systematic Biology, 55(1):162-169, 2006.
  • [Hebert 04] Paul D. N. Hebert, Erin H. Penton, John M. Burns, Daniel H. Janzen, and Winnie Hallwachs. Ten species in one: Dna barcoding reveals cryptic species in the neotropical skipper buttery astraptes fulgerator. In PNAS, volume 101, pages 14812-14817, 2004.
  • Software and Data

Matlab implementation of our fast string kernel algorithms for DNA barcoding is available upon request.  Please also see our Software page for more details. 

Barcode Datasets

  • The Hesperiidae dataset (hesperiidae.fasta) (see
  • The Astraptes dataset (astraptes.fasta) (see [Hebert 04] and [Nielsen 06] for details on dataset)
  • The Fish dataset (fish.fasta) (see [Steinke 05] for detailed description of the dataset)



    Spatial Kernels

    Fast Semi-Supervised Homology Prediction using Sparse Spatial Sample Kernel


    Establishing structural and/or functional relationship between sequences, for instance, to infer the structural class of an unannotated protein, is a key task in a biological sequence analysis. Recent methods such as profile kernels and mismatch neighborhood kernels have shown promising results; however, the incurred computational cost can be prohibitive in practice. In this study we propose a class of string-based kernels that are both biologically motivated and efficient to compute. 


    We present a new string kernel-based method, sparse spatial sample kernels (SSSK), for sequence classification tasks that offers the state-of-the-art accuracy and low computational cost. Application of the proposed methods to a remote homology detection yields significantly better performance than existing state-of-the-art algorithms (profile and mismatch neighborhood kernels.) The proposed methods can work with very large databases of protein sequences because of the low computational complexity and show substantial improvements in computing time over the existing methods. We also demonstrate the added value of the spatial information and multi-resolution sampling for achieving the state-of-the-art accuracy. The results have immediate practical value for accurate large-scale remote homology detection and classification of unannotated proteins.

    Supplementary Data


    Semi-Supervised Experiments

    ROC50 scores (plain text)
    ROC50 scores (html)

    Supervised Experiments

    ROC50 scores (plain text)
    ROC50 scores (html)

    triple(1,3) Kernel
    double(1,5) Kernel
    mismatch(5,1) Kernel


    All experiments are performed using SPIDER machine learning package. 
    The files are in bzip2 format.