Abstract:Online decision making aims to learn the optimal decision rule by making personalized decisions and updating the decision rule recursively. It has become easier than before with the help of big data, but new challenges also come along. Since the decision rule should be updated once per step, an offline update which uses all the historical data is inefficient in computation and storage. To this end, we propose a completely online algorithm that can make decisions and update the decision rule online via stochastic gradient descent. It is not only efficient but also supports all kinds of parametric reward models. Focusing on the statistical inference of online decision making, we establish the asymptotic normality of the parameter estimator produced by our algorithm and the online inverse probability weighted value estimator we used to estimate the optimal value. Online plugin estimators for the variance of the parameter and value estimators are also provided and shown to be consistent, so that interval estimation and hypothesis test are possible using our method. The proposed algorithm and theoretical results are tested by simulations and a real data application to news article recommendation.
What problem does this paper attempt to address?
This paper aims to solve several key problems in online decision - making, especially how to efficiently learn optimal decision rules in a big - data environment and conduct statistical inferences simultaneously. Specifically, the paper focuses on the following points:
1. **Efficient online update**: Traditional offline update methods require all historical data to update decision rules, which is computationally and storage - inefficient. The paper proposes a fully online algorithm that updates decision rules in real - time through the Stochastic Gradient Descent (SGD) method, thus improving efficiency.
2. **Statistical inference**: Online decision - making not only needs to find the optimal decision rules but also needs to evaluate the uncertainty of these rules and the average rewards they can achieve. The paper establishes the asymptotic normality of parameter estimators and proposes an online plug - in estimator to estimate the variances of parameters and value estimators, making interval estimation and hypothesis testing possible.
3. **Balance between exploration and exploitation**: In online decision - making, how to balance exploring new actions and exploiting known best actions is an important issue. The paper adopts the ε - greedy method to solve this problem, that is, choosing sub - optimal actions with a small probability to explore the unknown and choosing the currently optimal action with a large probability to exploit the known.
4. **Wide applicability**: The proposed algorithm supports various parameterized reward models, not just linear models. This means that the algorithm can be applied to more complex practical problems, such as news recommendation, precision medicine, and dynamic pricing.
### Main contributions of the paper
1. **Proposed a fully online decision - making algorithm**: This algorithm is based on SGD and modifies the gradient through Inverse Probability Weighting (IPW) to achieve online estimation of decision rules and expected rewards.
2. **Established statistical inference results for decision rules and expected rewards**: The paper proves the asymptotic normality of parameter estimators and value estimators and provides an online plug - in estimator to estimate the variance, making statistical inference possible.
3. **Improved computational and storage efficiency**: The algorithm does not need to store all historical data, only needs to store an amount of data of O(p²), greatly reducing the storage requirements, and improves the computational efficiency by online updating the second moment and Hessian matrix.
### Method overview
- **Online decision - making and ε - greedy strategy**: At each decision point, the algorithm estimates the optimal decision rule based on current features and historical data and selects actions with the ε - greedy strategy to balance exploration and exploitation.
- **SGD and IPW gradient**: Update the parameter estimator through the SGD method, use the IPW gradient to correct the sample distribution to make it close to the fixed decision rule distribution, thereby restoring the martingale structure and ensuring asymptotic normality.
- **Statistical inference**: The paper provides the asymptotic distribution of the parameter estimator and proposes an online plug - in estimator to estimate the variance, thus supporting interval estimation and hypothesis testing.
### Application examples
- **Linear reward model**: Assuming that the conditional mean reward function is linear, the paper shows how to use the quadratic loss function for parameter estimation and verifies relevant assumptions.
- **Logistic regression reward model**: When the result is binary classification, such as click or not in news recommendation, the paper shows how to use the cross - entropy loss function for parameter estimation and verifies relevant assumptions.
In conclusion, this paper provides an efficient online decision - making method with statistical inference capabilities, which is suitable for a variety of practical application scenarios.