Kernel Methods for Sequence Classification
Genomic approaches to classification of organisms exploit diversity among DNA sequences to identify organisms. The task of identification is to assign any unidentified organism to the taxonomic group at the target level: class, family, species, etc. Identification tasks are common in many application domains: applied taxonomy, general taxonomic systems, biosecurity, etc.
In the DNA barcoding setting, identification of organisms is performed using information from a standard gene region common across all taxa: in particular, mitochondrial genome of animals has been identified as a target for analysis. DNA barcode is a short fragment of DNA obtained from the standard region and is used as a marker for identification and classification of the species. Among many advantages of DNA barcodes are: standardization, ability to streamline exchange of data, to continually add in more species and integrate data collected from different sources. Inherent problems that need to be solved are accuracy of identification and scalability of the methods used for identification.
• We show efficiency of string kernels (mismatch/spectrum) as a sequence similarity measures in the DNA barcoding setting
• We present a framework for efficient computation of string kernels
• We propose divide-and-conquer technique for string kernels on DNA barcode data that provides a fast and scalable solution for many string kernel computation problems
• We develop a divide-and-conquer algorithm for the (k;m)-mismatch kernel that improves previously known bound on running time and is more memory efficient and easy to implement
We evaluated our methods on three datasets:
- Fish sequences (7 species, 56 barcodes)
- Astraptes dataset (12 species, 466 barcodes)
- Hesperiidae dataset (371 species, 2135 barcodes)
Accuracy of identification
- Figure 1: Astraptes data
- Figure 2: Hesperiidae data
Comparison of running times (mismatch kernel computation, k=5, m=1)
Algorithm 1 = new algorithm, algorithm 2 = old mismatch algorithm
Note: machine configuration: 2.8Ghz CPU, 1GB RAM
Note: order of magnitude improvement
The use of kernel-based approaches for DNA barcoding resulted in a highly accurate and scalable species identification method. At the same time, we were able to identify signatures (small subsets of sequence features) for many species classes. Proposed algorithmic approach to string kernel computations achieves substantial running time improvements in practice. Our results show that DNA barcoding can be successfully used to identify specimens by examining few sequence features, resulting in increased scalability and interpretability of current computational approaches to barcoding and species classification tasks.
- Kernel methods for DNA barcoding
- Rutgers Bioinformatics talk (Nov. 10, 2006)
- [Jaakkola 00] Tommi Jaakkola, Mark Diekhans & David Haussler. A Discriminative Framework for Detecting Remote Protein Homologies. Journal of Computational Biology, vol. 7, no. 1-2, pages 95-114, 2000.
- [Kuang 04] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund & Christina Leslie. Profile-Based String Kernels for Remote Homology Detection and Motif Extraction. In CSB ’04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB’04), pages 152-160, Washington, DC, USA, 2004. IEEE Computer Society.
- [Leslie 02a] Christina S. Leslie, Eleazar Eskin & William Stafford Noble. The Spectrum Kernel: A String Kernel for SVM Protein Classification. In Pacific Symposium on Biocomputing, pages 566-575, 2002.
- [Leslie 02b] Christina S. Leslie, Eleazar Eskin, Jason Weston & William Stafford Noble. Mismatch String Kernels for SVM Protein Classification. In NIPS, pages 1417-1424, 2002.
- [P.D.N. 03] Hebert P.D.N., A. Cywinska, S.L. Ball & J.R. deWaard. Biological identifications through DNA barcodes. In Proceedings of the Royal Society of London, pages 313-322, 2003
- [Steinke 05] Dirk Steinke, Miguel Vences, Walter Salzburger, and Axel Meyer. Taxi: a software tool for dna barcoding using distance methods. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1462):1975-1980, 2005.
- [Nielsen 06] Rasmus Nielsen and Mikhail Matz. Statistical approaches for DNA barcoding. Systematic Biology, 55(1):162-169, 2006.
- [Hebert 04] Paul D. N. Hebert, Erin H. Penton, John M. Burns, Daniel H. Janzen, and Winnie Hallwachs. Ten species in one: Dna barcoding reveals cryptic species in the neotropical skipper buttery astraptes fulgerator. In PNAS, volume 101, pages 14812-14817, 2004.
- Software and Data
Matlab implementation of our fast string kernel algorithms for DNA barcoding is available upon request. Please also see our Software page for more details.