Abstract:In the field of environmental science and engineering, quantitative structure-activity relationship (QSAR) means the quantitative relationship between the structure of molecules (or their aggregates e.g.. nanoparticles) and certain endpoints. Herein, endpoints generally refer to physicochemical properties. biological effects or environmental behavior parameters, etc. that can be measured or modeled. Based on data sets of chemical structures and their known endpoint values (i.e., training set), QSAR models could, by means of specific algorithms, establish the mathematical relationships between the digital features that characterize the molecular structure (i.e.. descriptors) and the endpoint values. Then, the established mathematical relationships can be employed to predict the endpoint values for given chemical structures. QSAR models are important tools for filling the data gap in environmental safety of chemicals and addressing the issues from so-called "emerging pollutants" that are closely related to the improper management of chemicals. Notably. QSAR models are intrinsically data-driven models. The relationships presented in the training set are not necessarily applicable to arbitrary chemical structures. The reliability of QSAR models is always limited to certain applicability domains. Therefore, acceptance of QSAR models in sound management of chemicals requires clearly defined applicability domains. This study reviewed three concepts of the applicability domain: Descriptor domain, structural domain and mechanism domain. For characterizing descriptor domain, methods based on hyper-rectangle, convex hull, joint probability density estimation and various types of distances were described. Notably, when Boolean fingerprints are used as descriptors, these methods become meaningless. Thus. implementation_ characters and advantages of the structural domain based on fingerprints and similarity, were specially introduced. Moreover, structure-activity landscapes (SALs), as exemplified by a network-like similarity graph (NSG) and a 3D topography of the endpoint and the descriptor values, were also demonstrated. "Activity cliffs" manifested in the SALs were highlighted. Based on the NSG, local discontinuity score (LDS) was calculated to quantitatively describe the "smoothness" of the SAL. LDS was also demonstrated to be linearly correlated to the cross-validation log loss of random forest classifiers for 59 Tox21 bioassay endpoints, further implying that smoothness of the training SAL could be an indicator of the performance of the established QSAR models. The causes of the "activity cliff" were further discussed and assumed to be dependent on the complexity and spatial heterogeneity of the investigated endpoint systems. Particularly, the chaos and emergence behavior of complex systems could be sensitive to tiny but specific structural changes in the small molecules and thus cannot be satisfactorily predicted based purely on the molecular descriptors. Hence. the understanding of the mechanisms underlying the endpoints was emphasized, which corresponds to the concept of the mechanism domain. In conclusion, in order to understand the feasibility of the molecular descriptors, explain the mechanisms of the QSARs. and reasonably select defining methods for the applicability domain, it is essential to address three questions: (1) What system does the endpoint intrinsically describe? (2) How complex and spatially heterogeneous is the system? (3) Whether the endpoint considers the emergence of systems behavior? With the enhancement of chemicals management and the emphasis on the treatment of novel pollutants, the development and use of QSAR models are anticipated to be increased. Only with basic knowledge of the mechanisms underlying the modeled endpoints, with sufficient recognition of the characters of the established QSAR models, and with properly selected applicability domain characterization methods. could the QSAR models yield relatively reliable predictions and thus benefit the environmental management of chemicals in China.

Characterization of Applicability Domains for QSAR Models

Quantitative Structure–activity Relationship: Promising Advances in Drug Discovery Platforms

Structure‐activity Relationship Approaches and Applications

QSAR Study of Endocrine Disrupting Chemicals

Research progress and application of some QSAR modeling approach in chemistry

Developing QSAR Models with Defined Applicability Domains on PPARγ Binding Affinity Using Large Data Sets and Machine Learning Algorithms.

Quantum chemical descriptors in quantitative structure–activity relationship models and their applications

Chemical Space Covered by Applicability Domains of Quantitative Structure-Property Relationships and Semiempirical Relationships in Chemical Assessments

Progress and perspectives of quantitative structure-activity relationships used for ecological risk assessment of toxic organic compounds

Enlarging Applicability Domain of Quantitative StructureActivity Relationship Models Through Uncertainty-Based Active Learning

Internal and External Validtions of QSAR Model：review

Rethinking the applicability domain analysis in QSAR models

Nano(Q)SAR: Challenges, Pitfalls and Perspectives

Consensus ranking approach to understanding the underlying mechanism with QSAR.

Assessment of Prediction Confidence and Domain Extrapolation of Two Structure-Activity Relationship Models for Predicting Estrogen Receptor Binding Activity

An Adaptive and Interpretable Modeling Architecture Assisted Rapid and Reliable Consensus Prediction for Hazardous Properties of Chemicals

Assessing Qsar Limitations - A Regulatory Perspective

Recent Advances in Chemometric Methodologies for QSAR Studies

QSAR Classification Modeling for Bioactivity of Molecular Structure via SPL-Logsum

QSAR Study on Toxic Effects of Reactive Organic Compounds

Descriptor Selection Via Log-Sum Regularization for the Biological Activities of Chemical Structure