Construction of balanced, chemically dissimilar training, validation and test sets for machine learning on molecular datasets

Giovanni A. Tricarico,Johan Hofmans,Eelke B. Lenselink,Miriam López-Ramos,Marie-Pierre Dréanic,Pieter F. W. Stouten
DOI: https://doi.org/10.26434/chemrxiv-2022-m8l33-v3
2024-03-29
Abstract:When preparing training, validation and test sets for machine learning on molecular datasets, it is desirable to combine two requirements: 1) robustness, i.e. making a test set that is chemically dissimilar from the training set; 2) data balance, i.e. ensuring that the proportion of data points and the distribution of data labels (categorical) / data values (continuous) are as homogeneous as possible among the sets, for each individual property to model, while partitioning the overall set of compounds as required. Recent literature shows that meeting both these requirements simultaneously is sometimes very difficult. This is especially true for multi-task learning, but also for single-task learning if one aims to balance the distribution of data labels or values, too. In this work we present a method that resolves this issue by first carrying out a chemistry-guided clustering of the initial dataset to ensure the separation of chemical matter, and subsequently applying linear programming to select the lists of clusters that – once assembled into the final sets – result in the best possible data balance.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to simultaneously achieve chemical dissimilarity and data balance among the training set, validation set and test set when performing machine learning on molecular data sets. Specifically: 1. **Chemical Dissimilarity**: To ensure the robustness of the model, the test set should be as different as possible from the training set in chemical structure. This can better evaluate the performance of the model when predicting compounds with different chemical structures, especially in drug discovery, where it is often necessary to extrapolate SAR (Structure - Activity Relationship) to other series of interest or find new chemical substances. 2. **Data Balance**: When splitting the data set, it is necessary to ensure that the proportion of data points and the distribution of data labels (categorical data) or data values (continuous data) in each subset are as uniform as possible. This means that not only does the overall data set need to be split according to a certain proportion, but this proportion also needs to be maintained for the data of each individual attribute. Meeting these two requirements simultaneously is very challenging, especially more obvious in multi - task learning. The paper proposes a method to solve this problem by first separating chemical substances based on chemically - guided clustering and then using linear programming to select the final subsets to achieve the best data balance.