Abstract:Various sequence-similarity kernels, the string kernels, have been introduced for use with support vector machines (SVMs) in a discriminative approach to the sequence data classification problems. In these applications, string kernels are asked to be similarity measures between strings. In this paper, we present a new string kernel and its variants suitable to sequence data classification, which are determined by (possibly non-contiguous) matching subsequences with all possible lengths shared by two strings. In these kernels, gaps in subsequences are allowed and the longer subsequences contribute more to the value of kernels. Efficient algorithms of computing the kernels are derived with the techniques of dynamic programming and bit-parallelism. In some cases, the computation of the kernel is linear in the length of the strings.

Length-weighted string kernels for sequence data classification