Data subsampling for Poisson regression with pth-root-link

Han Cheng Lie,Alexander Munteanu
2024-10-30
Abstract:We develop and analyze data subsampling techniques for Poisson regression, the standard model for count data $y\in\mathbb{N}$. In particular, we consider the Poisson generalized linear model with ID- and square root-link functions. We consider the method of coresets, which are small weighted subsets that approximate the loss function of Poisson regression up to a factor of $1\pm\varepsilon$. We show $\Omega(n)$ lower bounds against coresets for Poisson regression that continue to hold against arbitrary data reduction techniques up to logarithmic factors. By introducing a novel complexity parameter and a domain shifting approach, we show that sublinear coresets with $1\pm\varepsilon$ approximation guarantee exist when the complexity parameter is small. In particular, the dependence on the number of input points can be reduced to polylogarithmic. We show that the dependence on other input parameters can also be bounded sublinearly, though not always logarithmically. In particular, we show that the square root-link admits an $O(\log(y_{\max}))$ dependence, where $y_{\max}$ denotes the largest count presented in the data, while the ID-link requires a $\Theta(\sqrt{y_{\max}/\log(y_{\max})})$ dependence. As an auxiliary result for proving the tightness of the bound with respect to $y_{\max}$ in the case of the ID-link, we show an improved bound on the principal branch of the Lambert $W_0$ function, which may be of independent interest. We further show the limitations of our analysis when $p$th degree root-link functions for $p\geq 3$ are considered, which indicate that other analytical or computational methods would be required if such a generalization is even possible.
Machine Learning,Data Structures and Algorithms
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to develop and analyze data sub - sampling techniques for Poisson regression, especially for the generalized linear model (GLM) of count data \(y\in\mathbb{N}\). Specifically, the author focuses on Poisson regression with the identity - link function (ID - link) and the square root - link function (square root - link), and introduces the coreset method. A coreset is a small weighted subset that can approximate the loss function of Poisson regression with an error in the range of \(1\pm\varepsilon\). ### Main Problems and Challenges 1. **Lower Bound of Coreset Complexity**: The author shows an \(\Omega(n)\) lower bound for Poisson regression coresets, indicating that simply changing the link function does not solve the boundary problem of the log - link coreset complexity. This lower bound applies to any data reduction technique up to a logarithmic factor. 2. **Introducing a New Complexity Parameter**: To reduce the coreset size to sub - linear, the author introduces a new complexity parameter \(\rho\), and proves through the domain transformation method that when \(\rho\) is small, there exists a sub - linear coreset. 3. **The Influence of Different Link Functions**: For the square root - link function, the author proves a logarithmic relationship depending on the maximum count value \(y_{max}\); while for the identity - link function, a more complex dependency is required. 4. **Limitations of Higher - Order Root - Link Functions**: For higher - order root - link functions with \(p\geq3\), existing methods cannot provide the required \(1 +\varepsilon\) approximation, which indicates that other analysis or computational methods may be required to handle this situation. ### Solutions - **Coreset Construction**: Through the sensitivity framework, the author proposes an importance - sampling - based method to construct coresets, ensuring that they can approximate the original loss function with a precision of \(1\pm\varepsilon\). - **VC - Dimension Bounds**: The author reduces the VC - dimension from quadratic complexity to near - linear complexity through grouping and rounding techniques while maintaining a logarithmic dependence on the input parameters. - **Sensitivity Bounds**: By introducing the new complexity parameter \(\rho\), the author can better control the sensitivity, thereby reducing the size of the coreset. - **Domain Transformation Method**: To avoid high - sensitivity points in certain areas, the author introduces the domain transformation method to ensure that only solutions within a specific range are considered during the optimization process. ### Summary The main contribution of this paper is to provide the first rigorous theoretical analysis, proving that for the Poisson regression model, under the premise of satisfying the \(1\pm\varepsilon\) approximation, sub - linear - sized coresets can be achieved by introducing new complexity parameters and the domain transformation method. In addition, the author also explores the influence of different link functions on the coreset size and points out the limitations of higher - order root - link functions.