Abstract:Variable selection and outlier detection are important processes in chemical modeling. Usually, they affect each other. Their performing orders also strongly affect the modeling results. Currently, many studies perform these processes separately and in different orders. In this study, we examined the interaction between outliers and variables and compared the modeling procedures performed with different orders of variable selection and outlier detection. Because the order of outlier detection and variable selection can affect the interpretation of the model, it is difficult to decide which order is preferable when the predictabilities (prediction error) of the different orders are relatively close. To address this problem, a simultaneous variable selection and outlier detection approach called Model Adaptive Space Shrinkage (MASS) was developed. This proposed approach is based on model population analysis (MPA). Through weighted binary matrix sampling (WBMS) from model space, a large number of partial least square (PLS) regression models were built, and the elite parts of the models were selected to statistically reassign the weight of each variable and sample. Then, the whole process was repeated until the weights of the variables and samples converged. Finally, MASS adaptively found a high performance model which consisted of the optimized variable subset and sample subset. The combination of these two subsets could be considered as the cleaned dataset used for chemical modeling. In the proposed approach, the problem of the order of variable selection and outlier detection is avoided. One near infrared spectroscopy (NIR) dataset and one quantitative structure-activity relationship (QSAR) dataset were used to test this approach. The result demonstrated that MASS is a useful method for data cleaning before building a predictive model.

Progress of Chemical Modeling and Model Population Analysis

Progress of Chemical Modeling and Model Population Analysis

Model Population Analysis in Chemometrics

Model-population analysis and its applications in chemical and biological modeling

Model Population Analysis in Model Evaluation

A Systematic Survey of Chemical Pre-trained Models

Model population analysis for variable selection

Application of a genomic model for high-dimensional chemometric analysis

Multivariate Statistical Process Monitoring and Control: Recent Developments and Applications to Chemical Industry

Quantitative Structure–activity Relationship: Promising Advances in Drug Discovery Platforms

The Model Adaptive Space Shrinkage (MASS) Approach: a New Method for Simultaneous Variable Selection and Outlier Detection Based on Model Population Analysis

A Strategy on the Definition of Applicability Domain of Model Based on Population Analysis

APMG: 3D Molecule Generation Driven by Atomic Chemical Properties

The recent progress in proteochemometric modelling: focusing on target descriptors, cross-term descriptors and application scope.

Modern Semiempirical Electronic Structure Methods and Machine Learning Potentials for Drug Discovery: Conformers, Tautomers, and Protonation States

Chemometric methods in data processing of mass spectrometry-based metabolomics: A review

Advances of Machine Learning in Molecular Modeling and Simulation

Recent Developments and Applications of the MMPBSA Method

Matched Molecular Pair Analysis in Drug Discovery: Methods and Recent Applications

A Model Population Analysis Method For Variable Selection Based On Mutual Information

Holistic chemical evaluation reveals pitfalls in reaction prediction models