Boosted Bayesian Network Classifiers

Discriminative Graphical Models

In Collaboration with Dr. James M. Rehg and Yushi Jing, College of Computing, Georgia Institute of Technology.

Discriminative learning, or learning for classification, is a common learning task that has been addressed in a number of different frameworks.  One may design a complex classifier, such as a support vector machine, that explicitly minimizes classification error. Alternatively, a set of weak learners can be trained using the boosting algorithm [Schapire97].  However, one may be explicitly interested in constructing a generative model for classification, such as a Bayesian network. The option in that case is to discriminatively train this generative model.  Unfortunately, discriminative training of generative models is computationally complex [Friedman97,Greiner02,Grossman04].  On the other hand, if the model is trained in a generative ML fashion its strong reliance on correct structure and independence assumption often undermine its classification ability. 

In this project we study a new framework for discriminative training of generative models.  Similar to traditional boosting, we recursively learn a set classifiers, this time constructed from generative models. Unlike boosting, where weak classifiers are trained discriminatively, the ‘weak classifiers’ in our method are trained generatively, to maximize the likelihood of the weighted data.  This approach has two distinct benefits.  First, our classifiers are constructed from generative models.  This is important in many practical cases when generative models, such as Bayesian networks or HMM, are desired or appropriate (e.g., sequence modeling).  Second, the ML training of generative models is computationally more efficient than discriminative training of the same. Therefore, by discriminatively setting the weights on the data and by generatively training intermediate models we achieve a computationally efficient way of training generative classifiers.

Algorithms

Results


Publications

  1. Jing, Y., Pavlovic, V. & Rehg, J.M. (2008), “Boosted Bayesian network classifiers”, Machine Learning Journal.
  2. Jing, Y., Pavlovic, V. & Rehg, J.M. (2005) *Tech-report version (GIT-GVU-05-23)*  (Includes a preliminary analysis of boosted Dynamic Bayesian Network Classifiers, as an alternative to discriminative training methods like Conditional Random Fields )
  3. Jing, Y., Pavlovic, V. & Rehg, J.M. (2005) Efficient discriminative learning of Bayesian network classifiers via Boosted Augmented Naive Bayes – Proceedings of International Conference on Machine Learning (ICML 2005), Distinguished Student Paper Award.
  4. Jing, Y., Pavlovic, V. & Rehg, J.M. (2005) Discriminative Learning Using Boosted Generative Models – The Learning Workshop at Snowbird, Utah, April 5-8.


Software and Data

We have developed a C++ library for structure and parameter learning in boosted augmented Bayesian networks.  Please see our Software page for more details.

Classifying Brain Signals

 

The brain mechanisms underlying the ability of humans to process faces have been studied extensively in the last two decades. Brain imaging techniques, particularly fMRI (functional Magnetic Resonance Imaging) that possesses high spatial resolution but limited temporal resolution, are advancing our knowledge of the spatial and anatomical organization of face-sensitive brain areas. At the other end are EEG recording techniques with high temporal resolution but poor spatial resolution. They reveal event-related potentials (ERPs) that serve as correlates for various aspects of facial processing. The best-established ERP marker of face perception is N170, a negative component that occurs roughly 170 ms after stimulus onset. Other markers, such as N250, N400 and P600, which occur later than N170, are considered to contribute to face recognition and identification. In a typical fMRI or ERP study, signals are recorded while the subject is exposed to a face or a non-face stimulus to isolate the brain areas that respond differentially to faces (fMRI), and the temporal intervals that exhibit major differences between face and a non-face responses in the data (ERP). Early studies in both fMRI and ERP employed simplistic signal processing techniques, involving multiple instance averaging and then a manual examination to detect differentiating components. More advanced statistical signal analysis techniques were first applied to fMRI signals (for a review see. Systematic analyses generally lagged behind in the ERP domain. A recent principled approach is by Moulson et al, in which the authors applied statistical classification in the form of Linear Discrimination Analysis (LDA) to face/non-face ERPs and obtained classification accuracies of about 70%.

There are two major goals for the current ERP study: First, to emphasize higher brain areas of visual processing at the expense of early visual areas by using a strategically designed control non-face stimulus. Second, and more importantly, this study seeks to apply systematic machine learning and pattern recognition techniques for classifying face and non-face responses in both the spatial and temporal domains. Toward the first major goal, we use a face and a non-face stimulus derived from the well-known vase/faces illusion.

The key feature of the face and non-face stimuli is that they share the same defining contours, differing only slightly in stimulus space. The defining contours  are attributed to whatever forms the figure; as the percept alternates spontaneously, these contours are attributed alternately to the faces or to the vase.

This attribution is biased, using minimal image manipulations, toward the vase in or the faces, but the same contours are used in both cases. These shared contours result in relatively similar responses for the faces and vase stimuli and produce a difficult classification task for the ensuing ERP signals. With this choice of face and non-face stimuli, we bias the analysis towards detecting signal differences that are elicited more by the high-level percepts (face or non-face) rather than low-level image differences.

With respect to the second major goal, we note that several types of classifiers have been used in previous EEG classification studies: kNN, logistic regression and multi layer perceptron, support vector machines and LDA. In authors have used group penalty for lasso regression in frequency domain for EEG signal classification and showed the utility of grouping in that domain. The present study extends this systematic use of classifiers by using the ERP signals obtained with the faces/vase stimuli to test three major classification schemes: kNN, L1-norm logistic regression, and group lasso logistic regression. We perform the classification analysis between two classes of face and vase responses agnostically, based on purely statistical estimates, without favoring any sensors or temporal intervals. Our goal is to use the data to point to salient spatio-temporal ERP signatures most indicative of the stimuli classes. Obviously, ERP classification can provide important applications in neuroscience, such as in brain-computer interfaces (BCI) and in detecting pathologies such as epilepsy.

The main results of our tests with the three major classifier schemes are: First, kNN produced the worst performance, having to rely on all, potentially noisy, features. Second, the other two schemes were able to classify the signals with an overall accuracy of roughly 80%. Finally, the learned weights of L1-norm logistic regression and group lasso were able to locate the salient features in space (electrode position) and time in close agreement with the accepted wisdom of previous studies, confirming various markers of face perception such as N170, N400 and P600.

 

Applying sparse dictionary methods not only improved the classification performance but also pointed out the spatial and temporal regions of high interest. That is, the most discriminant features in time and space (channels) stand for ERPs and active regions of brain that are resposible for distinguishing between a face and non-face. Following figures illustrate them.

Mapping of L1 regression coefficients to channels (first figure) and time (second figure)

 

Related Publications

  • [1] S. Shariat, V. Pavlovic, T. Papathomas, A. Braun and P. Sinha. “Sparse dictionary methods for EEG signal classification in face perception”. Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on. 2010. pp. 331 – 336.

 

Protein Classification

Protein Classification and Functional Prediction 
Problem 

Proteins are linear seqeuences of amino acids that perform essential functions in organisms. To determine the function of a protein, expensive experimental procedures are required. With better sequencing techniques, it becomes evident that computational prediction/classfication of an unknown protein’s function will help drive the labor and cost down. . 

Protein remote homology detection involves detecting functionally related proteins with very little sequence identity. The task is challanging due to two major reasons: first, the length of the sequences vary and second, remotely homologous proteins need not share great sequence identity. Some share sequence identity as little as 30%. 

Motivation

Previous studies [1,2,3] show that the following features are sufficient to describe superfamilies in the Immunoglobulin-like beta-sandwich fold: 
• a set of critical positions, and 
• the preferred amino acid residues in these positions, and 
• the distance between each neighboring pair of critical positions. 
Our goal is to find computational methods, with good classification performance, that are able to recover these critical positions as well as the preferred residues. 

Our approaches
We attempt to recover these critical positions by employing different approaches. In our earlier work, we used a generative model; At present, we use a discriminative model. 

The generative approach
We employ a class of models, known as duration-explicit hidden Markov models, in an attempt to recover the critical positions. We call our models sparse profile hidden markov models and show the structure in Fig. 1. The structure of our model resembles that of a profile hidden Markov model. However, the crucial difference between our model and the profile hidden Markov model is that, while the number of symbols an insertion state (diamond shape) emits is represented by a geometric distribution, in our model, the number of symbols an insertion state emits can be explicitly modeled by any arbitrary distribution. Moreover, since we only care about the critical positions, we use the background distribution to represent the emission probabilities in the insertion states in order to express our indifference in the residue composition between critical positions. Our match states attempt to capture the critical positions. 

Our approach of focusing on the critical positions and modeling the emission probabilities of the insertion states as background probabilities results in extremely sparse models. Our publication [5] indicates that approximately m/4 critical positions are sufficient to describe a superfamily with m positions. Fig. 2 and 3 show the performance, in terms of recall (red curve) and precision (blue curves) of the sparse profile HMM on two different superfamilies: Beta-Galactosidase and Cadherin, as a function of number of critical positions (horizontal axis). It can be seen that the performance of the sparse profile HMM starts showing severe degradation once the number of critical positions falls lower than 25%. 

Our main contribution to this project is that our results are consistent with previous established studies [1,2,3], that a set of critical positions, the preferred symbols on these positions and the distance between each neighboring positions are sufficient to describe the superfamilies in the Immunoglobulin-like beta-sandwich fold. Moreover, our models are extremely sparse and the main advantage of a sparse model is that fewer parameters need to be estimated and thus the model is less prone to overfitting. Our results indicate that a 1:4 compression ratio can be achieved.



The discriminative approach
Though in most problem domains a generative model performs quite well, it does come with some disadvantages. Denote X as the sequence (example) and Y as the label of the sequence. First, one needs to assume a parametric form P(X) and when the parametric assumption is invalid, one risks injecting bias into the model. Second, building a generative model for a superfamily requires only positive sequences, sequences that are known to belong to this superfamily; no negative sequences participated in the process of estimation. As a result, if one picks the important features indicated by the built generative model, one cannot guarantee that these features do not appear in the negative examples. 

Discriminative models do not suffer from the shortcomings listed above. First, no parametric form of P(X) is required. Second, it focuses on building the decision boundary between the positive and negative classes (meaing both positive and negative examples participate in the estimation process) and thus it guarantees that the important features only appear in one class but not the other. 

However, use of discriminative models for protein remote homology detection presents a big challange: proteins are of variable length but the discriminative model requires fixed-length features. To tackle the problem, Jaakkola et al. [6,7] proposed to use a hyprid model: a generative model, trained using positive sequences only acts as a feature extractor, extracting features from positive and negative sequences and the features are fed to to a discriminative model to perform classification. The features are the gradient of the log likelihood of the sequence with respect to the model parameters. On the other hand, Leslie et al. [8,9] proposed to represent each sequence by the frequency of all k-long substrings in a |\Sigma|^k-dimensional spectrum, where \Sigma represents the alphabet set, and use these high dimensional features to perform classification. 

We follow the same formalism proposed by Jaakkola et al. In our hybrid setting, we use a profile HMM, a generative model, as the feature extractor and the discriminative model we use is the logistic regression model. The generative models are estimated from positive sequences only and the features that we propose to use are the sufficient statistics of the symbols in each match states. These features we propose attempt to capture the critical positions and the corresponding preferred symbols. Finally, we place a Normal or Laplacian prior centered at 0, on the parameter of the logistic regression model to enforce sparsity. 

We compare our model and features with some state-of-the-art algorithms in Fig. 4 and 5. We use the ROC-50 scores to evaluate the methods. The ROC-50 score is the normalized area under the ROC curve up to the first 50 false positives. Such criteria has been reported to be more effective when comparing methods on highly imbalanced test sets, in which examples in one class, in our case, the negative examples, are far greater than the examples in the other class. In Fig. 4, the horizontal axis corresponds to the ROC-50 score and the vertical access denotes the number of experiments, out of 54, achieving such or better ROC-50 score. The subfigure on the right panel is a more detailed view in the area of high ROC50 score. The figures indicate that both logistic regression models (with Normal and Laplacian priors) achieve comparable performance with the mismatch(5,1) kernel. However, we noted that, out of 54 families, there are 480 features on average. The logistic regression model with Laplacian prior selects 43 features per family on average, which shows more than 90% reduction of features. 

In Fig. 5, we show the pairwise performance comparison on different methods. In the left panel, we compare the mismatch(5,1) kernel (vertical axis) with the logistic regression model with Normal prior. Points falling above the diagonal correspond to an experiment in which the logistic model with Normal prior performs better than the mismatch(5,1) kernel. There are 30 and 22 points above and below the diagonal, respectively. The two-sided p-value of the sign test is 0.33, which indicates that the performance of the two models are comparable. In the left panel, we compare the effect of different priors on the logistic regression model. The priors are Normal prior (horizontal axis) and Laplacian prior (vertical axis). There are 25 and 26 points above and below the diagonal, respectively. The two-sided p-value of the sign test is approximately 1, suggesting that although the Laplacian prior only selects 10% of the features, this sparse set of features play important role in the classification task. 

We show the recovered critical positions and the preferred symbols in Fig. 6 and 7. The model (table) shown in Fig. 6 corresponds to the scorpion toxin-like superfamily and the tested family is Plant defensins. Out of 188 features, the logistic regression model with Laplacian prior picks 19 features around 12 positions. The ROC50 score for this experiment is 1. We also show the corresponding HMM-logo obtained from PFAM. The height of each symbol at each position corresponds to its importance suggested by the profile HMM, estimated from positive sequences only. The leading numbers in the table denotes the positions in our hmm while the number in parentheses denotes the positions in the HMM-logo. The slight variation occurs because we use a slightly different huristic to determine the structure of the profile HMM. We can see that the HMM-logo from PFAM suggests that 8 positions with Cysteine (C) residues play important role in this superfamily. However, only 5 such positions (circled in red) are captured by our model. Upon investigation, it becomes clear that the reason that the other three positions with Cysteine residues are not selected because such alignment has also been observed in the negative sequences. The model shown in Fig. 7 corresponds to the same superfamily, with a different tested family. The logistic regression model with Laplacian prior selects 14 out of 116 features and the ROC50 score of this experiment is .91. In this experiment, all 6 positions with Cysteine residues indicated by the HMM-logo are captured by our model, suggesting that such pattern is characteristic and unique to this family. 

Our main contribution to this project is that we present a set of features that are biologically intuitive and, when combined with the model we choose, yields interpretable results. The performance of our feature and model is comparable to the state-of-the-art algorithms, SVM-Fisher and string kernels, but our features and model offers biologically intuitive interpretation. With the use of a sparsity-enforcing prior such as Laplace, our models produces consistent results suggested in [1,2,3], that a sparse set of positions and the preferred residues might be sufficient to describe a superfamily, and more importantly, our model shows promising results to recover such positions and preferred residues.



References

  1. A. E. Kister, A. V. Finkelstein and I. M. Gelfand. Common features in structures and sequences of sandwich-like proteins. PNAS, vol. 99, no. 22, pp. 14137-14141,2002.
  2. B. Reva, A. Kister, S. Topiol, and I. Gelfand. Determining the roles of different chain fragments in recognition of immunoglobulin fold. Protein Eng., vol. 15, no. 1, pp. 13-19, 2002.
  3. A. E. Kister, M. A. Roytberg, C. Chothia, J. M. Vasiliev, and I. M. Gelfand. The sequence determinants of cadherin molecules. Protein Sci., vol. 10, no. 9, pp. 1801-1810, 2001.
  4. S. Eddy. Profile hidden Markov models. Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.
  5. P. Huang and V. Pavlovic. Protein Homology Detection using Sparse Profile Hidden Markov Models (Poster). ISMB 2006, Detroit, MI.
  6. T. Jaakkola, M. Diekhans, and D. Haussler. Using the Fisher Kernel method to detect remote protein homologies. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. AAAI Press, 1999, pp. 149-158.
  7. T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, vol. 7, no. 1-2, 2000, pp. 95-114.
  8. C. S. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for svm protein classification. Pacific Symposium on Biocomputing, 2002, pp. 566-575.
  9. C. S. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for svm protein classification. NIPS, 2002, pp. 1417-1424.

Kernel Barcoding

Kernel Methods for Sequence Classification

DNA Barcoding

Genomic approaches to classification of organisms exploit diversity among DNA sequences to identify organisms. The task of identification is to assign any unidentified organism to the taxonomic group at the target level: class, family, species, etc. Identification tasks are common in many application domains: applied taxonomy, general taxonomic systems, biosecurity, etc.

In the DNA barcoding setting, identification of organisms is performed using information from a standard gene region common across all taxa: in particular, mitochondrial genome of animals has been identified as a target for analysis. DNA barcode is a short fragment of DNA obtained from the standard region and is used as a marker for identification and classification of the species. Among many advantages of DNA barcodes are: standardization, ability to streamline exchange of data, to continually add in more species and integrate data collected from different sources. Inherent problems that need to be solved are accuracy of identification and scalability of the methods used for identification.

Our contributions:

• We show efficiency of string kernels (mismatch/spectrum) as a sequence similarity measures in the DNA barcoding setting 
• We present a framework for efficient computation of string kernels 
• We propose divide-and-conquer technique for string kernels on DNA barcode data that provides a fast and scalable solution for many string kernel computation problems 
• We develop a divide-and-conquer algorithm for the (k;m)-mismatch kernel that improves previously known bound on running time and is more memory efficient and easy to implement 

Results
We evaluated our methods on three datasets: 

  • Fish sequences (7 species, 56 barcodes)
  • Astraptes dataset (12 species, 466 barcodes)
  • Hesperiidae dataset (371 species, 2135 barcodes)

 

Accuracy of identification

  • Figure 1: Astraptes data

  • Figure 2: Hesperiidae data

 

 

Comparison of running times (mismatch kernel computation, k=5, m=1)

Algorithm 1 = new algorithm, algorithm 2 = old mismatch algorithm

Note: machine configuration: 2.8Ghz CPU, 1GB RAM 
Note: order of magnitude improvement 

Conclusions

The use of kernel-based approaches for DNA barcoding resulted in a highly accurate and scalable species identification method. At the same time, we were able to identify signatures (small subsets of sequence features) for many species classes. Proposed algorithmic approach to string kernel computations achieves substantial running time improvements in practice. Our results show that DNA barcoding can be successfully used to identify specimens by examining few sequence features, resulting in increased scalability and interpretability of current computational approaches to barcoding and species classification tasks. 

Papers

  • Kernel methods for DNA barcoding
  • Rutgers Bioinformatics talk (Nov. 10, 2006)

References

  • [Jaakkola 00] Tommi Jaakkola, Mark Diekhans & David Haussler. A Discriminative Framework for Detecting Remote Protein Homologies. Journal of Computational Biology, vol. 7, no. 1-2, pages 95-114, 2000.
  • [Kuang 04] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund & Christina Leslie. Profile-Based String Kernels for Remote Homology Detection and Motif Extraction. In CSB ’04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB’04), pages 152-160, Washington, DC, USA, 2004. IEEE Computer Society.
  • [Leslie 02a] Christina S. Leslie, Eleazar Eskin & William Stafford Noble. The Spectrum Kernel: A String Kernel for SVM Protein Classification. In Pacific Symposium on Biocomputing, pages 566-575, 2002.
  • [Leslie 02b] Christina S. Leslie, Eleazar Eskin, Jason Weston & William Stafford Noble. Mismatch String Kernels for SVM Protein Classification. In NIPS, pages 1417-1424, 2002.
  • [P.D.N. 03] Hebert P.D.N., A. Cywinska, S.L. Ball & J.R. deWaard. Biological identifications through DNA barcodes. In Proceedings of the Royal Society of London, pages 313-322, 2003
  • [Steinke 05] Dirk Steinke, Miguel Vences, Walter Salzburger, and Axel Meyer. Taxi: a software tool for dna barcoding using distance methods. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1462):1975-1980, 2005.
  • [Nielsen 06] Rasmus Nielsen and Mikhail Matz. Statistical approaches for DNA barcoding. Systematic Biology, 55(1):162-169, 2006.
  • [Hebert 04] Paul D. N. Hebert, Erin H. Penton, John M. Burns, Daniel H. Janzen, and Winnie Hallwachs. Ten species in one: Dna barcoding reveals cryptic species in the neotropical skipper buttery astraptes fulgerator. In PNAS, volume 101, pages 14812-14817, 2004.
  • Software and Data

Matlab implementation of our fast string kernel algorithms for DNA barcoding is available upon request.  Please also see our Software page for more details. 

Barcode Datasets

  • The Hesperiidae dataset (hesperiidae.fasta) (see www.barcodinglife.org)
  • The Astraptes dataset (astraptes.fasta) (see [Hebert 04] and [Nielsen 06] for details on dataset)
  • The Fish dataset (fish.fasta) (see [Steinke 05] for detailed description of the dataset)
  •  

     

     

    Spatial Kernels

    Fast Semi-Supervised Homology Prediction using Sparse Spatial Sample Kernel

    Motivation:

    Establishing structural and/or functional relationship between sequences, for instance, to infer the structural class of an unannotated protein, is a key task in a biological sequence analysis. Recent methods such as profile kernels and mismatch neighborhood kernels have shown promising results; however, the incurred computational cost can be prohibitive in practice. In this study we propose a class of string-based kernels that are both biologically motivated and efficient to compute. 

    Results:

    We present a new string kernel-based method, sparse spatial sample kernels (SSSK), for sequence classification tasks that offers the state-of-the-art accuracy and low computational cost. Application of the proposed methods to a remote homology detection yields significantly better performance than existing state-of-the-art algorithms (profile and mismatch neighborhood kernels.) The proposed methods can work with very large databases of protein sequences because of the low computational complexity and show substantial improvements in computing time over the existing methods. We also demonstrate the added value of the spatial information and multi-resolution sampling for achieving the state-of-the-art accuracy. The results have immediate practical value for accurate large-scale remote homology detection and classification of unannotated proteins.

    Supplementary Data

    Datasets


    Semi-Supervised Experiments

    ROC50 scores (plain text)
    ROC50 scores (html)


    Supervised Experiments

    ROC50 scores (plain text)
    ROC50 scores (html)

    triple(1,3) Kernel
    double(1,5) Kernel
    mismatch(5,1) Kernel

    Notes:

    All experiments are performed using SPIDER machine learning package. 
    The files are in bzip2 format.