Abstract:A persistent challenge in astronomical machine learning is a systematic bias where predictions compress the dynamic range of true values -- high values are consistently predicted too low while low values are predicted too high. Understanding this bias has important consequences for astronomical measurements and our understanding of physical processes in astronomical inference. Through analytical examination of linear regression, we show that this bias arises naturally from measurement uncertainties in input features and persists regardless of training sample size, label accuracy, or parameter distribution. In the univariate case, we demonstrate that attenuation becomes important when the ratio of intrinsic signal range to measurement uncertainty ($\sigma_{\text{range}}/\sigma_x$) is below O(10) -- a regime common in astronomy. We further extend the theoretical framework to multivariate linear regression and demonstrate its implications using stellar spectroscopy as a case study. Even under optimal conditions -- high-resolution APOGEE-like spectra (R=24,000) with high signal-to-noise ratios (SNR=100) and multiple correlated features -- we find percent-level bias. The effect becomes even more severe for modern-day low-resolution surveys like LAMOST and DESI due to the lower SNR and resolution. These findings have broad implications, providing a theoretical framework for understanding and addressing this limitation in astronomical data analysis with machine learning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the problem of systematic underestimation of extreme values by machine - learning models in astronomy. Specifically, the authors are concerned with the fact that the prediction results compress the dynamic range of the true values: high values are systematically underestimated, while low values are overestimated. This bias has an important impact on astronomical measurements and the understanding of physical processes. ### Core of the problem 1. **Systematic bias**: Machine - learning models show systematic bias in prediction, resulting in high values being underestimated and low values being overestimated. 2. **Cause exploration**: This bias is not caused by insufficient training data or sample imbalance, but by the measurement uncertainty of input features. 3. **Theoretical framework**: By analyzing the linear regression model, the authors show how this bias naturally arises from measurement uncertainty and is independent of training sample size, label accuracy or parameter distribution. 4. **Multivariable extension**: The research is further extended to multivariable linear regression, and using stellar spectroscopy as a case study, shows its impact in practical applications. ### Specific manifestations - In the univariate case, when the ratio of the signal range to the measurement uncertainty ($\frac{\sigma_{\text{range}}}{\sigma_x}$) is below $O(10)$, the attenuation effect becomes significant, which is a common phenomenon in astronomy. - Even under optimal conditions (such as high - resolution APOGEE - like spectra with a resolution of $R = 24,000$ and a signal - to - noise ratio of $SNR = 100$), there is still a percentage - level bias. - For modern low - resolution surveys (such as LAMOST and DESI), due to the lower signal - to - noise ratio and resolution, this effect is more severe. ### Solution The paper proposes a theoretical framework to understand and solve this problem, emphasizing the systematic attenuation effect of measurement uncertainty on regression coefficients. Through this framework, the authors hope to provide a theoretical basis and practical insights for machine - learning applications in astronomical data analysis. ### Mathematical expression In univariate linear regression, assume that the observed values $y_{\text{obs}}$ and $x_{\text{obs}}$ are respectively: \[ y_{\text{obs}}=\beta x_{\text{true}}+\delta_y \] \[ x_{\text{obs}}=x_{\text{true}}+\delta_x \] where: - $\beta$ is the slope, - $\delta_y$ represents measurement uncertainty and intrinsic scattering, with $E[\delta_y] = 0$ and $\text{Var}(\delta_y)=\sigma^2_y$, - $\delta_x$ represents measurement error, with $E[\delta_x] = 0$ and $\text{Var}(\delta_x)=\sigma^2_x$. The expected value of the regression coefficient $\hat{\beta}$ estimated by the least - squares method is: \[ E[\hat{\beta}]=\frac{\text{Cov}(x_{\text{obs}}, y_{\text{obs}})}{\text{Var}(x_{\text{obs}})}=\frac{\beta\sigma^2_{\text{range}}}{\sigma^2_{\text{range}}+\sigma^2_x}=\beta\left(\frac{1}{1 +\left(\frac{\sigma_x}{\sigma_{\text{range}}}\right)^2}\right) \] Define the attenuation factor $\lambda_\beta$ as: \[ \lambda_\beta=\frac{1}{1+\left(\frac{\sigma_x}{\sigma_{\text{range}}}\right)^

Why Machine Learning Models Systematically Underestimate Extreme Values

Systematic Bias in Sample Inference and its Effect on Machine Learning

A Systematic Bias of Machine Learning Regression Models and Its Correction: an Application to Imaging-based Brain Age Prediction

Regression for Astronomical Data with Realistic Distributions, Errors and Non-linearity

Measurement errors and scaling relations in astrophysics: a review

Negative impact of heavy-tailed uncertainty and error distributions on the reliability of calibration statistics for machine learning regression tasks

Machine Learning LSST 3x2pt analyses -- forecasting the impact of systematics on cosmological constraints using neural networks

Large-scale power loss in ground-based CMB mapmaking

Reconciling modern machine learning practice and the bias-variance trade-off

Validity Concerns of Using Machine Learning in Management Research

Unraveling overoptimism and publication bias in ML-driven science

Exploring galactic properties with machine learning Predicting star formation, stellar mass, and metallicity from photometric data

From Photometric Redshifts to Improved Weather Forecasts: machine learning and proper scoring rules as a basis for interdisciplinary work

Distance-based Analysis of Machine Learning Prediction Reliability for Datasets in Materials Science and Other Fields

Analyzing Astronomical Data with Machine Learning Techniques

Exploring galactic properties with machine learning

AstroMLab 1: Who Wins Astronomy Jeopardy!?

Machine Learning in Astronomy: a practical overview

De-biasing "bias" measurement

Biased Moments of Undersampled Sources

Understanding Bias in Machine Learning