Why Machine Learning Models Systematically Underestimate Extreme Values

Yuan-Sen Ting
2024-12-08
Abstract:A persistent challenge in astronomical machine learning is a systematic bias where predictions compress the dynamic range of true values -- high values are consistently predicted too low while low values are predicted too high. Understanding this bias has important consequences for astronomical measurements and our understanding of physical processes in astronomical inference. Through analytical examination of linear regression, we show that this bias arises naturally from measurement uncertainties in input features and persists regardless of training sample size, label accuracy, or parameter distribution. In the univariate case, we demonstrate that attenuation becomes important when the ratio of intrinsic signal range to measurement uncertainty ($\sigma_{\text{range}}/\sigma_x$) is below O(10) -- a regime common in astronomy. We further extend the theoretical framework to multivariate linear regression and demonstrate its implications using stellar spectroscopy as a case study. Even under optimal conditions -- high-resolution APOGEE-like spectra (R=24,000) with high signal-to-noise ratios (SNR=100) and multiple correlated features -- we find percent-level bias. The effect becomes even more severe for modern-day low-resolution surveys like LAMOST and DESI due to the lower SNR and resolution. These findings have broad implications, providing a theoretical framework for understanding and addressing this limitation in astronomical data analysis with machine learning.
Instrumentation and Methods for Astrophysics,Cosmology and Nongalactic Astrophysics,Astrophysics of Galaxies,Solar and Stellar Astrophysics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the problem of systematic underestimation of extreme values by machine - learning models in astronomy. Specifically, the authors are concerned with the fact that the prediction results compress the dynamic range of the true values: high values are systematically underestimated, while low values are overestimated. This bias has an important impact on astronomical measurements and the understanding of physical processes. ### Core of the problem 1. **Systematic bias**: Machine - learning models show systematic bias in prediction, resulting in high values being underestimated and low values being overestimated. 2. **Cause exploration**: This bias is not caused by insufficient training data or sample imbalance, but by the measurement uncertainty of input features. 3. **Theoretical framework**: By analyzing the linear regression model, the authors show how this bias naturally arises from measurement uncertainty and is independent of training sample size, label accuracy or parameter distribution. 4. **Multivariable extension**: The research is further extended to multivariable linear regression, and using stellar spectroscopy as a case study, shows its impact in practical applications. ### Specific manifestations - In the univariate case, when the ratio of the signal range to the measurement uncertainty ($\frac{\sigma_{\text{range}}}{\sigma_x}$) is below $O(10)$, the attenuation effect becomes significant, which is a common phenomenon in astronomy. - Even under optimal conditions (such as high - resolution APOGEE - like spectra with a resolution of $R = 24,000$ and a signal - to - noise ratio of $SNR = 100$), there is still a percentage - level bias. - For modern low - resolution surveys (such as LAMOST and DESI), due to the lower signal - to - noise ratio and resolution, this effect is more severe. ### Solution The paper proposes a theoretical framework to understand and solve this problem, emphasizing the systematic attenuation effect of measurement uncertainty on regression coefficients. Through this framework, the authors hope to provide a theoretical basis and practical insights for machine - learning applications in astronomical data analysis. ### Mathematical expression In univariate linear regression, assume that the observed values $y_{\text{obs}}$ and $x_{\text{obs}}$ are respectively: \[ y_{\text{obs}}=\beta x_{\text{true}}+\delta_y \] \[ x_{\text{obs}}=x_{\text{true}}+\delta_x \] where: - $\beta$ is the slope, - $\delta_y$ represents measurement uncertainty and intrinsic scattering, with $E[\delta_y] = 0$ and $\text{Var}(\delta_y)=\sigma^2_y$, - $\delta_x$ represents measurement error, with $E[\delta_x] = 0$ and $\text{Var}(\delta_x)=\sigma^2_x$. The expected value of the regression coefficient $\hat{\beta}$ estimated by the least - squares method is: \[ E[\hat{\beta}]=\frac{\text{Cov}(x_{\text{obs}}, y_{\text{obs}})}{\text{Var}(x_{\text{obs}})}=\frac{\beta\sigma^2_{\text{range}}}{\sigma^2_{\text{range}}+\sigma^2_x}=\beta\left(\frac{1}{1 +\left(\frac{\sigma_x}{\sigma_{\text{range}}}\right)^2}\right) \] Define the attenuation factor $\lambda_\beta$ as: \[ \lambda_\beta=\frac{1}{1+\left(\frac{\sigma_x}{\sigma_{\text{range}}}\right)^