What problem does this paper attempt to address?

The problem that this paper attempts to solve is to extend the Soft Actor - Critic (SAC) algorithm from continuous action spaces to discrete action spaces. Specifically, the SAC algorithm performs excellently in continuous action settings, but cannot be directly applied in discrete action settings. Many practical application scenarios involve discrete actions (such as Atari games), so a version of SAC suitable for discrete actions is required. ### Main contributions of the paper: 1. **Extending SAC to discrete action spaces**: The paper derives a version of the SAC algorithm (called SAC - Discrete) applicable to discrete - action environments, enabling SAC to be applied to more types of reinforcement learning tasks. 2. **Performance verification**: Through experiments on the Atari game suite, it is proved that SAC - Discrete can be comparable to existing model - independent state - of - the - art algorithms (such as Rainbow) even without hyper - parameter tuning. ### Specific problem description: - **Existing problems**: The original Soft Actor - Critic algorithm can only handle tasks in continuous action spaces, while many practical applications (such as Atari games) involve discrete action spaces. - **Solutions**: The paper modifies the key steps of SAC to enable it to handle discrete action spaces. These modifications include: - Let the Q - function output the Q - values of each possible action instead of only for the input action. - Modify the policy network to directly output the probability distribution of actions instead of the mean and covariance. - When calculating the soft state - value function and the temperature loss, the expected value can be directly calculated without using Monte Carlo estimation. - The re - parameterization trick is no longer required to minimize the policy objective function. ### Experimental results: - The paper conducted experiments on 20 Atari games. The results show that SAC - Discrete outperforms Rainbow in 10 games, the median performance in all games is close to that of Rainbow, the maximum improvement reaches + 4330%, and the minimum decrease is - 99%. ### Conclusions: The paper successfully extends SAC to discrete action spaces, and the experiments on Atari games show that even without hyper - parameter tuning, SAC - Discrete can still compete with state - of - the - art model - independent algorithms in terms of sample efficiency. This provides the possibility for SAC to be applied in more types of tasks. ### Related formulas: 1. **Maximum entropy objective**: \[ \pi^*=\arg\max_{\pi}\sum_{t = 0}^{T}\mathbb{E}_{(s_t,a_t)\sim\tau^\pi}[\gamma^t(r(s_t,a_t)+\alpha H(\pi(\cdot|s_t)))] \] where \(H(\pi(\cdot|s_t)) = -\log\pi(\cdot|s_t)\) is the entropy of the policy. 2. **Soft state - value function** (discrete - action version): \[ V(s_t):=\pi(s_t)^T[Q(s_t)-\alpha\log(\pi(s_t))] \] 3. **Temperature loss** (discrete - action version): \[ J(\alpha)=\pi_t(s_t)^T[-\alpha(\log(\pi_t(s_t))+\bar{H})] \] 4. **Policy objective function** (discrete - action version): \[ J_\pi(\phi)=\mathbb{E}_{s_t\sim D}[\pi_t(s_t)^T[\alpha\log(\pi_\phi(s_t))-Q_\theta(s_t)]] \] Through these modifications, SAC - Discrete successfully solves the discrete action space.

Soft Actor-Critic for Discrete Action Settings

Revisiting Discrete Soft Actor-Critic

Generalizing soft actor-critic algorithms to discrete action spaces

Soft Decomposed Policy-Critic: Bridging the Gap for Effective Continuous Control with Discrete RL

Self-play Reinforcement Learning with Comprehensive Critic in Computer Games

PAC-Bayesian Soft Actor-Critic Learning

Corrected Soft Actor Critic for Continuous Control

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Continuous control with deep reinforcement learning

DSAC-T: Distributional Soft Actor-Critic with Three Refinements

An Actor-Critic Method for Simulation-Based Optimization

Actor-Critic with variable time discretization via sustained actions

Soft Actor-Critic with Inhibitory Networks for Faster Retraining

Explorer-Actor-Critic: Better Actors for Deep Reinforcement Learning

OPAC: Opportunistic Actor-Critic

Continuous-time adaptive critics

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

Bayesian Soft Actor-Critic: A Directed Acyclic Strategy Graph Based Deep Reinforcement Learning

ISAACS: Iterative Soft Adversarial Actor-Critic for Safety

Risk-Sensitive Soft Actor-Critic for Robust Deep Reinforcement Learning under Distribution Shifts

A priority experience replay actor-critic algorithm using self-attention mechanism for strategy optimization of discrete problems