Characterization of Applicability Domains for QSAR Models

Zhongyu Wang,Jingwen Chen,Zhiqiang Fu,Xuehua Li
DOI: https://doi.org/10.1360/tb-2021-0406
2023-01-01
Abstract:In the field of environmental science and engineering, quantitative structure-activity relationship (QSAR) means the quantitative relationship between the structure of molecules (or their aggregates e.g.. nanoparticles) and certain endpoints. Herein, endpoints generally refer to physicochemical properties. biological effects or environmental behavior parameters, etc. that can be measured or modeled. Based on data sets of chemical structures and their known endpoint values (i.e., training set), QSAR models could, by means of specific algorithms, establish the mathematical relationships between the digital features that characterize the molecular structure (i.e.. descriptors) and the endpoint values. Then, the established mathematical relationships can be employed to predict the endpoint values for given chemical structures. QSAR models are important tools for filling the data gap in environmental safety of chemicals and addressing the issues from so-called "emerging pollutants" that are closely related to the improper management of chemicals. Notably. QSAR models are intrinsically data-driven models. The relationships presented in the training set are not necessarily applicable to arbitrary chemical structures. The reliability of QSAR models is always limited to certain applicability domains. Therefore, acceptance of QSAR models in sound management of chemicals requires clearly defined applicability domains. This study reviewed three concepts of the applicability domain: Descriptor domain, structural domain and mechanism domain. For characterizing descriptor domain, methods based on hyper-rectangle, convex hull, joint probability density estimation and various types of distances were described. Notably, when Boolean fingerprints are used as descriptors, these methods become meaningless. Thus. implementation_ characters and advantages of the structural domain based on fingerprints and similarity, were specially introduced. Moreover, structure-activity landscapes (SALs), as exemplified by a network-like similarity graph (NSG) and a 3D topography of the endpoint and the descriptor values, were also demonstrated. "Activity cliffs" manifested in the SALs were highlighted. Based on the NSG, local discontinuity score (LDS) was calculated to quantitatively describe the "smoothness" of the SAL. LDS was also demonstrated to be linearly correlated to the cross-validation log loss of random forest classifiers for 59 Tox21 bioassay endpoints, further implying that smoothness of the training SAL could be an indicator of the performance of the established QSAR models. The causes of the "activity cliff" were further discussed and assumed to be dependent on the complexity and spatial heterogeneity of the investigated endpoint systems. Particularly, the chaos and emergence behavior of complex systems could be sensitive to tiny but specific structural changes in the small molecules and thus cannot be satisfactorily predicted based purely on the molecular descriptors. Hence. the understanding of the mechanisms underlying the endpoints was emphasized, which corresponds to the concept of the mechanism domain. In conclusion, in order to understand the feasibility of the molecular descriptors, explain the mechanisms of the QSARs. and reasonably select defining methods for the applicability domain, it is essential to address three questions: (1) What system does the endpoint intrinsically describe? (2) How complex and spatially heterogeneous is the system? (3) Whether the endpoint considers the emergence of systems behavior? With the enhancement of chemicals management and the emphasis on the treatment of novel pollutants, the development and use of QSAR models are anticipated to be increased. Only with basic knowledge of the mechanisms underlying the modeled endpoints, with sufficient recognition of the characters of the established QSAR models, and with properly selected applicability domain characterization methods. could the QSAR models yield relatively reliable predictions and thus benefit the environmental management of chemicals in China.
What problem does this paper attempt to address?