Interpretable QSPR Modeling using Recursive Feature Machines and Multi-scale Fingerprints

Jiaxuan Shen,Haitao Zhang,Yunjie Wang,Yilong Wang,Song Tao,Bo Qiu,Ng Shyh-Chang
2024-11-21
Abstract:This study pioneers the application of Recursive Feature Machines (RFM) in QSPR modeling, introducing a tailored feature importance analysis approach to enhance interpretability. By leveraging deep feature learning through AGOP, RFM achieves state-of-the-art (SOTA) results in predicting molecular properties, as demonstrated through solubility prediction across nine benchmark datasets. To capture a wide array of structural information, we employ diverse molecular representations, including MACCS keys, Morgan fingerprints, and a custom multi-scale hybrid fingerprint (HF) derived from global descriptors and SMILES local fragmentation techniques. Notably, the HF offers significant advantages over MACCS and Morgan fingerprints in revealing structural determinants of molecular properties. The feature importance analysis in RFM provides robust local and global explanations, effectively identifying structural features that drive molecular behavior and offering valuable insights for drug development. Additionally, RFM demonstrates strong redundancy-filtering abilities, as model performance remains stable even after removing redundant features within custom fingerprints. Importantly, RFM introduces the deep feature learning capabilities of the average gradient outer product (AGOP) matrix into ultra-fast kernel machine learning, to imbue kernel machines with interpretable deep feature learning capabilities. We extend this approach beyond the Laplace Kernel to the Matern, Rational Quadratic, and Gaussian kernels, to find that the Matern and Laplace kernels deliver the best performance, thus reinforcing the flexibility and effectiveness of AGOP in RFM. Experimental results show that RFM-HF surpasses both traditional machine learning models and advanced graph neural networks.
Biomolecules
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of insufficient interpretability of Quantitative Structure - Property Relationship (QSPR) models in drug development and molecular design. Although existing QSPR models have made remarkable progress in predicting molecular properties, many of them lack interpretability, which hinders the development of new drugs and new molecules. Specifically, by introducing Recursive Feature Machines (RFM) and multi - scale fingerprints, this paper proposes an interpretable QSPR modeling method. RFM combines the advantages of deep feature learning and kernel machine learning, and can provide local and global feature importance analysis, thereby improving the interpretability and prediction performance of the model. #### Main research objectives: 1. **Improve the interpretability of QSPR models**: By introducing RFM and customized multi - scale hybrid fingerprints (HF), the model can not only accurately predict molecular properties, but also explain which structural features have an important impact on molecular behavior. 2. **Improve molecular representation methods**: Use MACCS keys, Morgan fingerprints and customized multi - scale hybrid fingerprints (HF) to capture a wider range of structural information, especially those structural features that have a decisive impact on molecular properties. 3. **Verify the effectiveness of RFM**: Through solubility prediction experiments on nine benchmark datasets, prove the superiority of RFM in prediction accuracy and interpretability, surpassing traditional machine learning models and advanced graph neural networks (GNNs). 4. **Explore the performance of different kernel functions**: Apply AGOP to different kernel functions such as Laplace, Matern, Rational Quadratic and Gaussian, and evaluate their performance in terms of over - fitting and generalization ability. In conclusion, by introducing RFM and multi - scale fingerprints, this paper solves the problem of poor interpretability of existing QSPR models and provides valuable guidance for drug development and molecular design.