Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning

Alex Ayoub,Kaiwen Wang,Vincent Liu,Samuel Robertson,James McInerney,Dawen Liang,Nathan Kallus,Csaba Szepesvári
2024-08-02
Abstract:We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in batch reinforcement learning (Batch Reinforcement Learning, Batch RL), how to train Fitted Q - Iteration (FQI) by using log - loss to reduce the number of samples required to learn approximately optimal policies. Specifically, the paper focuses on how to improve sample efficiency in goal - oriented tasks when the optimal policy can reliably achieve the goal and the cost is set to penalize failure to reach the goal. ### Core problems of the paper 1. **Improvement of sample efficiency**: The paper proposes a new algorithm - Fitted Q - Iteration with log - loss (FQI - LOG), and proves that its sample complexity is proportional to the cost of the optimal policy. This means that in tasks where the optimal policy can reliably achieve the goal and the cost is close to zero, FQI - LOG can significantly reduce the number of samples required compared to the traditional FQI with square loss (FQI - SQ). 2. **Theoretical guarantee**: The author provides a theoretical analysis and proves that FQI - LOG has a small - cost bound in batch RL, which means that the error bound is proportional to the cost of the optimal policy. This small - cost bound has not been achieved by other efficient batch RL algorithms in previous work. 3. **Empirical verification**: In addition to theoretical analysis, the paper also verifies through experiments that FQI - LOG is indeed more effective than FQI - SQ in some tasks, especially in those tasks where the optimal policy can reliably achieve the goal. ### Specific problem description In batch RL, the learner learns the policy from a fixed data set and cannot further interact with the environment. The standard method is Fitted Q - Iteration (FQI), which approximates the optimal policy by iteratively updating the value function. However, the traditional FQI uses square loss to measure the deviation between the predicted value and the target value, which may lead to high sample complexity in some cases. The paper points out that when the cost of the optimal policy is close to zero, FQI with log - loss (FQI - LOG) can significantly reduce the number of samples required. This is because the log - loss imposes a greater penalty on cases where the predicted value is far from the observed mean, especially when the observed mean is close to the boundary. This makes FQI - LOG more inclined to fit those observed values close to the boundary, thus improving the stability of the learning process. ### Summary The main contribution of this paper is to propose FQI with log - loss (FQI - LOG) and theoretically prove that it has a small - cost bound, that is, the error bound is proportional to the cost of the optimal policy. In addition, it is proved through experiments that FQI - LOG is indeed more effective than the traditional FQI - SQ in some tasks, especially in those tasks where the optimal policy can reliably achieve the goal.