Proximal Validation Protocol

MingFeng Ou,Yiming Zhang,Sai Wu,Gang Chen,Junbo Zhao
2023-01-01
Abstract:Modern machine learning algorithms are generally built upon a train/validation/test split protocol. In particular, with the absence of accessible testing set in real-world ML development, how to split out a validation set becomes crucial for reliable model evaluation, selection and etc. Concretely, under a randomized splitting setup, the split ratio of the validation set generally acts as a vital meta-parameter; that is, with more data picked and used for validation, it would cost model performance due to the less training data, and vice versa. Unfortunately, this implies a vexing trade-off between performance enhancement against trustful model evaluation. However, to date, the research conducted on this line remains very few. We reason this could be due to a workflow gap between the academic and ML production which we may attribute to a form of technical debt of ML. In this article, we propose a novel scheme --- dubbed Proximal Validation Protocol (PVP) --- which is targeted to resolve this problem of validation set construction. Core to PVP is to assemble a \emph{proximal set} as a substitution for the traditional validation set while avoiding the valuable data wasted by the training procedure. The construction of the proximal validation set is established with dense data augmentation followed by a novel distributional-consistent sampling algorithm. With extensive empirical findings, we prove that PVP works (much) better than all the other existing validation protocols on three data modalities (images, text, and tabular data), demonstrating its feasibility towards ML production.
What problem does this paper attempt to address?