Abstract:Abstract Ensemble learning helps improve machine learning results by combining several models and allows the production of better predictive performance compared to a single model. It also benefits and accelerates the researches in quantitative structure–activity relationship (QSAR) and quantitative structure–property relationship (QSPR). With the growing number of ensemble learning models such as random forest, the effectiveness of QSAR/QSPR will be limited by the machine’s inability to interpret the predictions to researchers. In fact, many implementations of ensemble learning models are able to quantify the overall magnitude of each feature. For example, feature importance allows us to assess the relative importance of features and to interpret the predictions. However, different ensemble learning methods or implementations may lead to different feature selections for interpretation. In this paper, we compared the predictability and interpretability of four typical well-established ensemble learning models (Random forest, extreme randomized trees, adaptive boosting and gradient boosting) for regression and binary classification modeling tasks. Then, the blending methods were built by summarizing four different ensemble learning methods. The blending method led to better performance and a unification interpretation by summarizing individual predictions from different learning models. The important features of two case studies which gave us some valuable information to compound properties were discussed in detail in this report. QSPR modeling with interpretable machine learning techniques can move the chemical design forward to work more efficiently, confirm hypothesis and establish knowledge for better results.

Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling

Practical guidelines for the use of gradient boosting for molecular property prediction

A Case-Based Meta-Learning Algorithm Boosts the Performance of Structure-Based Virtual Screening.

Boosting the Partial Least Square Algorithm for Regression Modelling

Light Gradient Boosting Machine as a Regression Method for Quantitative Structure-Activity Relationships

QSPR Study for Prediction of Boiling Points of 2475 Organic Compounds Using Stochastic Gradient Boosting

Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling

Boost-S: Gradient Boosted Trees for Spatial Data and Its Application to FDG-PET Imaging Data

SimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines

Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS) and its application on modeling ligand functionality for 5HT-subtype GPCR families

Comprehensive ensemble in QSAR prediction for drug discovery

Using LogitBoost classifier to predict protein structural classes.

QBoost: Predicting quantiles with boosting for regression and binary classification

Using Support Vector Regression Coupled with the Genetic Algorithm for Predicting Acute Toxicity to the Fathead Minnow

Comparison and improvement of the predictability and interpretability with ensemble learning models in QSPR applications

Development of the AdaBoost-SVM model for the classification of estrogen receptor-β ligands

Prediction of Blood-Brain Barrier Permeability of Compounds by Fusing Resampling Strategies and eXtreme Gradient Boosting

TwinBooster: Synergising Large Language Models with Barlow Twins and Gradient Boosting for Enhanced Molecular Property Prediction

Structure-Based Molecule Optimization via Gradient-Guided Bayesian Update

ChemBoost: A chemical language based approach for protein-ligand binding affinity prediction

High Performance of Gradient Boosting in Binding Affinity Prediction