Spatial Kernels

Fast Semi-Supervised Homology Prediction using Sparse Spatial Sample Kernel

Motivation:

Establishing structural and/or functional relationship between sequences, for instance, to infer the structural class of an unannotated protein, is a key task in a biological sequence analysis. Recent methods such as profile kernels and mismatch neighborhood kernels have shown promising results; however, the incurred computational cost can be prohibitive in practice. In this study we propose a class of string-based kernels that are both biologically motivated and efficient to compute. 

Results:

We present a new string kernel-based method, sparse spatial sample kernels (SSSK), for sequence classification tasks that offers the state-of-the-art accuracy and low computational cost. Application of the proposed methods to a remote homology detection yields significantly better performance than existing state-of-the-art algorithms (profile and mismatch neighborhood kernels.) The proposed methods can work with very large databases of protein sequences because of the low computational complexity and show substantial improvements in computing time over the existing methods. We also demonstrate the added value of the spatial information and multi-resolution sampling for achieving the state-of-the-art accuracy. The results have immediate practical value for accurate large-scale remote homology detection and classification of unannotated proteins.

Supplementary Data

Datasets


Semi-Supervised Experiments

ROC50 scores (plain text)
ROC50 scores (html)


Supervised Experiments

ROC50 scores (plain text)
ROC50 scores (html)

triple(1,3) Kernel
double(1,5) Kernel
mismatch(5,1) Kernel

Notes:

All experiments are performed using SPIDER machine learning package. 
The files are in bzip2 format.