How to select an objective function using information theory

Timothy O. Hodson,Thomas M. Over,Tyler J. Smith,Lucy M. Marshall
DOI: https://doi.org/10.1029/2023WR035803
2024-06-04
Abstract:In machine learning or scientific computing, model performance is measured with an objective function. But why choose one objective over another? Information theory gives one answer: To maximize the information in the model, select the objective function that represents the error in the fewest bits. To evaluate different objectives, transform them into likelihood functions. As likelihoods, their relative magnitude represents how strongly we should prefer one objective versus another, and the log of that relation represents the difference in their bit-length, as well as the difference in their uncertainty. In other words, prefer whichever objective minimizes the uncertainty. Under the information-theoretic paradigm, the ultimate objective is to maximize information (and minimize uncertainty), as opposed to any specific utility. We argue that this paradigm is well-suited to models that have many uses and no definite utility, like the large Earth system models used to understand the effects of climate change.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to select the best objective function (or performance metric) during the modeling process. According to information theory, the best objective function is the one that can minimize information loss. The paper evaluates different objective functions by using AIC (Akaike Information Criterion) to determine which objective function performs best in a specific application. The paper emphasizes the importance of information loss and proposes an information - theory - based method to select objective functions, aiming to minimize uncertainty, maximize the amount of information and general utility. This method not only helps to improve the accuracy of the model, but also is beneficial to the calibration efficiency, generalization ability and data compression ability of the model.