72 Developing Machine Learning Models When Data is Limiting

Luis O Tedeschi
DOI: https://doi.org/10.1093/jas/skac247.061
2022-09-21
Journal of Animal Science
Abstract:Abstract Animal scientists have become more enthusiastic about developing machine learning (ML) models to improve the predictability of variables of interest. However, adequate data is limited either because ways to collect the data are scarce or because the process is expensive, time- or labor-consuming, or it simply takes too long to be collected. In addition to the usual hurdles in developing ML models, e.g., appropriate technique, the number of layers, and their activation, enough high-quality data is an essential requirement for ML that is often neglected. Should the data come from longitudinal or cross-sectional experiments of how many subjects? When enough high-quality, reliable data is limiting, one alternative is to create synthetic datasets that reflect the correlation among input and output variables of interest. The probability distribution for each variable needs to be well defined, and the correlation among variables must be taken into account. The normal distribution may not always be a reasonable assumption for all variables; thus, variable-specific distributions must be used with their appropriate parameters. An adequate range (min and max) must be provided for each variable to represent it. Shortcomings might occur when nonlinear relationships occur between variables. The synthetic dataset will also fail to provide good predictability when the new inputs do not have similar correlations, as did the variables used to build the synthetic dataset. It is common to standardize or normalize each variable during ML development. Each variable is normalized based on its minimum and maximum values. A common mistake during the ML development process is that training and evaluation subsets are normalized independently when both should be normalized using the range of the complete dataset; otherwise, normalization becomes dependent on the subset, and the ML weights will differ from one epoch to another. This might weaken the ML predictability because the correct range for back-transforming the output to its original value is unknown.
agriculture, dairy & animal science
What problem does this paper attempt to address?