R-NN Curves: an Intuitive Approach to Outlier Detection Using a Distance Based Method

Rajarshi Guha,Debojyoti Dutta,Peter C. Jurs,Ting Chen
DOI: https://doi.org/10.1021/ci060013h
IF: 6.162
2006-01-01
Journal of Chemical Information and Modeling
Abstract:Libraries of chemical structures are used in a variety of cheminformatics tasks such as virtual screening and QSAR modeling and are generally characterized using molecular descriptors. When working with libraries it is useful to understand the distribution of compounds in the space defined by a set of descriptors. We present a simple approach to the analysis of the spatial distribution of the compounds in a library in general and outlier detection in particular based on counts of neighbors within a series of increasing radii. The resultant curves, termed R-NN curves, appear to follow a logistic model for any given descriptor space, which we justify theoretically for the 2D case. The method can be applied to data sets of arbitrary dimensions. The R-NN curves provide a visual method to easily detect compounds lying in a sparse region of a given descriptor space. We also present a method to numerically characterize the R-NN curves thus allowing identification of outliers in a single plot.
What problem does this paper attempt to address?