Fast Semi-Supervised Homology Prediction using Sparse Spatial Sample Kernel
Motivation:
Establishing structural and/or functional relationship between sequences, for instance, to infer the structural class of an unannotated protein, is a key task in a biological sequence analysis. Recent methods such as profile kernels and mismatch neighborhood kernels have shown promising results; however, the incurred computational cost can be prohibitive in practice. In this study we propose a class of string-based kernels that are both biologically motivated and efficient to compute.
Results:
We present a new string kernel-based method, sparse spatial sample kernels (SSSK), for sequence classification tasks that offers the state-of-the-art accuracy and low computational cost. Application of the proposed methods to a remote homology detection yields significantly better performance than existing state-of-the-art algorithms (profile and mismatch neighborhood kernels.) The proposed methods can work with very large databases of protein sequences because of the low computational complexity and show substantial improvements in computing time over the existing methods. We also demonstrate the added value of the spatial information and multi-resolution sampling for achieving the state-of-the-art accuracy. The results have immediate practical value for accurate large-scale remote homology detection and classification of unannotated proteins.
Supplementary Data
Datasets
- SCOP Benchmark ( benchmark table, sequence file, labeled sequence headers ). The dataset is obtained from here.
- PDB Dataset
- Swissprot Dataset
- NR Dataset
Semi-Supervised Experiments
ROC50 scores (plain text)
ROC50 scores (html)
- SCOP: triple(1,3) Kernel, double(1,5) Kernel, profile(5,7.5) Kernel
- Swiss-Prot: triple(1,3) Kernel, double(1,5) Kernel, profile(5,7.5) Kernel
- PDB: triple(1,3) Kernel, double(1,5) Kernel, profile(5,7.5) Kernel
- NR: triple(1,3) Kernel, double(1,5) Kernel, profile(5,7.5) Kernel
Supervised Experiments
ROC50 scores (plain text)
ROC50 scores (html)
triple(1,3) Kernel
double(1,5) Kernel
mismatch(5,1) Kernel
Notes:
All experiments are performed using SPIDER machine learning package.
The files are in bzip2 format.