Regularization of Soft Actor-Critic Algorithms with Automatic Temperature Adjustment

Ben You
2023-05-23
Abstract:This work presents a comprehensive analysis to regularize the Soft Actor-Critic (SAC) algorithm with automatic temperature adjustment. The the policy evaluation, the policy improvement and the temperature adjustment are reformulated, addressing certain modification and enhancing the clarity of the original theory in a more explicit manner.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the theoretical basis and implementation details of the automatic temperature adjustment mechanism in the Soft Actor - Critic (SAC) algorithm. Specifically, the author aims to: 1. **Clarify the ambiguity in theoretical derivations**: After introducing the automatic temperature adjustment mechanism, some theoretical derivations of the SAC algorithm may have ambiguity and inconsistency, especially in the recursive definition of the soft Q - function. The author hopes to clarify these ambiguities and correct the defects in the original paper by re - deriving the Bellman equation, policy improvement, and temperature adjustment. 2. **Revise the recursive definition of the soft Q - function**: The author points out that the recursive definition of the soft Q - function in the original literature lacks a crucial term \(-\alpha H_0\), which may lead to over - exploration or under - exploration problems during the policy evaluation process. Therefore, the author re - derives the recursive definition of the soft Q - function to ensure the correctness and completeness of its expression. 3. **Define the optimization problem of policy improvement clearly**: The author emphasizes that policy improvement should be based on the optimization problem (Equation (1)), rather than arbitrary information projection. This means that the derivation of policy improvement needs to strictly follow a specific mathematical form to ensure its rationality and effectiveness. 4. **Incorporate the influence of state expectations**: In the process of policy improvement and automatic temperature adjustment, the expected value of the state must be considered. This point may have been overlooked in the original literature, and the author, through detailed derivation and analysis, points out the importance of this point and ensures its correct application in the algorithm implementation. ### Formula Summary - **Recursive definition of the soft Q - function**: \[ Q(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{s_{t + 1}\sim p, a_{t + 1}\sim\pi_{t + 1}}[Q(s_{t + 1}, a_{t + 1})-\alpha_{t + 1}\log\pi_{t + 1}(a_{t + 1}|s_{t + 1})-\alpha_{t + 1}H_0] \] - **Optimization problem of policy improvement**: \[ \pi^*_t=\arg\max_{\pi_t}\mathbb{E}_{(s_t, a_t)\sim\rho_{\pi_t}}[Q(s_t, a_t)-\alpha_t(\log\pi_t(a_t|s_t)+H_0)] \] - **Optimization problem of automatic temperature adjustment**: \[ \alpha^*_t=\arg\min_{\alpha_t\geq0}\alpha_t\left\{\mathbb{E}_{s_t\sim\rho_{\pi^*_t}, a_t\sim\pi^*_t}[-\log\pi^*_t(a_t|s_t)-H_0]\right\} \] Through these improvements, the author hopes to provide a more rigorous and reliable theoretical basis for the SAC algorithm, thereby enhancing its performance and stability in practical applications.