Kernel matrix regression

Yoshihiro Yamanishi,Jean-Philippe Vert
DOI: https://doi.org/10.48550/arXiv.q-bio/0702054
2007-02-26
Abstract:We address the problem of filling missing entries in a kernel Gram matrix, given a related full Gram matrix. We attack this problem from the viewpoint of regression, assuming that the two kernel matrices can be considered as explanatory variables and response variables, respectively. We propose a variant of the regression model based on the underlying features in the reproducing kernel Hilbert space by modifying the idea of kernel canonical correlation analysis, and we estimate the missing entries by fitting this model to the existing samples. We obtain promising experimental results on gene network inference and protein 3D structure prediction from genomic datasets. We also discuss the relationship with the em-algorithm based on information geometry.
Quantitative Methods,Statistics Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **how to fill in the missing entries in the kernel Gram matrix**, given a related complete kernel Gram matrix. Specifically, the author considered two datasets that describe the same set of objects: the first dataset is complete and a complete kernel Gram matrix can be constructed; while only part of the second dataset is available, so only a partial kernel Gram matrix with missing entries can be constructed. When more interested in the second dataset, it is natural to think of using the existing information in the first dataset to estimate the missing kernel matrix elements in the second dataset. ### Problem Background This kind of problem is very common in fields such as bioinformatics. For example: - **Protein Structure Prediction**: DNA sequences are easily obtainable, but the 3D structures of most proteins are still unknown and difficult to determine. - **Gene Network Inference**: Genome - wide data (such as gene expression data) are easily obtainable, but metabolic network information is only known for a limited number of genes. ### Solution To solve this problem, the author proposed the **Kernel Matrix Regression (KMR)** model based on the features of Reproducing Kernel Hilbert Space (RKHS) from the perspective of regression. This model modifies the idea of Kernel Canonical Correlation Analysis (kCCA), regards the kernel matrices of explanatory variables and response variables as inner products, and estimates the missing entries by fitting existing samples. In addition, the author also explored the mathematical relationship between KMR and the EM algorithm based on information geometry, and verified the effectiveness of the proposed method through experiments. ### Experimental Results The author conducted experiments on the tasks of gene network inference and protein 3D structure prediction. The results show that: - KMR and its regularized version (Penalized KMR, PKMR) are competitive in performance with other methods. - When the regularization parameter is appropriately selected, PKMR performs the best. In conclusion, this paper proposes a new method to fill in the missing entries in the kernel matrix and proves its effectiveness through experiments.