Soft Actor-Critic for Discrete Action Settings

Petros Christodoulou
DOI: https://doi.org/10.48550/arXiv.1910.07207
2019-10-18
Abstract:Soft Actor-Critic is a state-of-the-art reinforcement learning algorithm for continuous action settings that is not applicable to discrete action settings. Many important settings involve discrete actions, however, and so here we derive an alternative version of the Soft Actor-Critic algorithm that is applicable to discrete action settings. We then show that, even without any hyperparameter tuning, it is competitive with the tuned model-free state-of-the-art on a selection of games from the Atari suite.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to extend the Soft Actor - Critic (SAC) algorithm from continuous action spaces to discrete action spaces. Specifically, the SAC algorithm performs excellently in continuous action settings, but cannot be directly applied in discrete action settings. Many practical application scenarios involve discrete actions (such as Atari games), so a version of SAC suitable for discrete actions is required. ### Main contributions of the paper: 1. **Extending SAC to discrete action spaces**: The paper derives a version of the SAC algorithm (called SAC - Discrete) applicable to discrete - action environments, enabling SAC to be applied to more types of reinforcement learning tasks. 2. **Performance verification**: Through experiments on the Atari game suite, it is proved that SAC - Discrete can be comparable to existing model - independent state - of - the - art algorithms (such as Rainbow) even without hyper - parameter tuning. ### Specific problem description: - **Existing problems**: The original Soft Actor - Critic algorithm can only handle tasks in continuous action spaces, while many practical applications (such as Atari games) involve discrete action spaces. - **Solutions**: The paper modifies the key steps of SAC to enable it to handle discrete action spaces. These modifications include: - Let the Q - function output the Q - values of each possible action instead of only for the input action. - Modify the policy network to directly output the probability distribution of actions instead of the mean and covariance. - When calculating the soft state - value function and the temperature loss, the expected value can be directly calculated without using Monte Carlo estimation. - The re - parameterization trick is no longer required to minimize the policy objective function. ### Experimental results: - The paper conducted experiments on 20 Atari games. The results show that SAC - Discrete outperforms Rainbow in 10 games, the median performance in all games is close to that of Rainbow, the maximum improvement reaches + 4330%, and the minimum decrease is - 99%. ### Conclusions: The paper successfully extends SAC to discrete action spaces, and the experiments on Atari games show that even without hyper - parameter tuning, SAC - Discrete can still compete with state - of - the - art model - independent algorithms in terms of sample efficiency. This provides the possibility for SAC to be applied in more types of tasks. ### Related formulas: 1. **Maximum entropy objective**: \[ \pi^*=\arg\max_{\pi}\sum_{t = 0}^{T}\mathbb{E}_{(s_t,a_t)\sim\tau^\pi}[\gamma^t(r(s_t,a_t)+\alpha H(\pi(\cdot|s_t)))] \] where \(H(\pi(\cdot|s_t)) = -\log\pi(\cdot|s_t)\) is the entropy of the policy. 2. **Soft state - value function** (discrete - action version): \[ V(s_t):=\pi(s_t)^T[Q(s_t)-\alpha\log(\pi(s_t))] \] 3. **Temperature loss** (discrete - action version): \[ J(\alpha)=\pi_t(s_t)^T[-\alpha(\log(\pi_t(s_t))+\bar{H})] \] 4. **Policy objective function** (discrete - action version): \[ J_\pi(\phi)=\mathbb{E}_{s_t\sim D}[\pi_t(s_t)^T[\alpha\log(\pi_\phi(s_t))-Q_\theta(s_t)]] \] Through these modifications, SAC - Discrete successfully solves the discrete action space.