Abstract:We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is in batch reinforcement learning (Batch Reinforcement Learning, Batch RL), how to train Fitted Q - Iteration (FQI) by using log - loss to reduce the number of samples required to learn approximately optimal policies. Specifically, the paper focuses on how to improve sample efficiency in goal - oriented tasks when the optimal policy can reliably achieve the goal and the cost is set to penalize failure to reach the goal. ### Core problems of the paper 1. **Improvement of sample efficiency**: The paper proposes a new algorithm - Fitted Q - Iteration with log - loss (FQI - LOG), and proves that its sample complexity is proportional to the cost of the optimal policy. This means that in tasks where the optimal policy can reliably achieve the goal and the cost is close to zero, FQI - LOG can significantly reduce the number of samples required compared to the traditional FQI with square loss (FQI - SQ). 2. **Theoretical guarantee**: The author provides a theoretical analysis and proves that FQI - LOG has a small - cost bound in batch RL, which means that the error bound is proportional to the cost of the optimal policy. This small - cost bound has not been achieved by other efficient batch RL algorithms in previous work. 3. **Empirical verification**: In addition to theoretical analysis, the paper also verifies through experiments that FQI - LOG is indeed more effective than FQI - SQ in some tasks, especially in those tasks where the optimal policy can reliably achieve the goal. ### Specific problem description In batch RL, the learner learns the policy from a fixed data set and cannot further interact with the environment. The standard method is Fitted Q - Iteration (FQI), which approximates the optimal policy by iteratively updating the value function. However, the traditional FQI uses square loss to measure the deviation between the predicted value and the target value, which may lead to high sample complexity in some cases. The paper points out that when the cost of the optimal policy is close to zero, FQI with log - loss (FQI - LOG) can significantly reduce the number of samples required. This is because the log - loss imposes a greater penalty on cases where the predicted value is far from the observed mean, especially when the observed mean is close to the boundary. This makes FQI - LOG more inclined to fit those observed values close to the boundary, thus improving the stability of the learning process. ### Summary The main contribution of this paper is to propose FQI with log - loss (FQI - LOG) and theoretically prove that it has a small - cost bound, that is, the error bound is proportional to the cost of the optimal policy. In addition, it is proved through experiments that FQI - LOG is indeed more effective than the traditional FQI - SQ in some tasks, especially in those tasks where the optimal policy can reliably achieve the goal.

Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning

Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning.

Mildly Conservative Q-Learning for Offline Reinforcement Learning

Budgeting Counterfactual for Offline RL

A Minimalist Approach to Offline Reinforcement Learning

Offline Reinforcement Learning with Implicit Q-Learning

Fast Rates for the Regret of Offline Reinforcement Learning

Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Adaptable Conservative Q-Learning for Offline Reinforcement Learning.

Strategically Conservative Q-Learning

Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

Offline Quantum Reinforcement Learning in a Conservative Manner

Adaptive pessimism via target Q-value for offline reinforcement learning

Offline RL with No OOD Actions: In-Sample Learning Via Implicit Value Regularization

Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

UDQL: Bridging The Gap between MSE Loss and The Optimal Value Function in Offline Reinforcement Learning