Multi-Style Distributional Soft Actor-Critic: Learning a Unified Policy for Diverse Control Behaviors
Liming Xiao,Yao Lyu,Fawang Zhang,Liangfa Chen,Guangyuan Yu,Shengbo Eben Li,Fei Ma,Jingliang Duan
DOI: https://doi.org/10.1109/tiv.2024.3432891
IF: 8.2
2024-01-01
IEEE Transactions on Intelligent Vehicles
Abstract:Reinforcement learning (RL) has excelled in sequential decision-making and control tasks, yet traditional RL algorithms are limited by adherence to a single control style in identical scenarios, failing to address varied control preferences. Existing multi-style RL methods typically require customized reward or objective functions tailored to specific control styles, which may not be feasible when diverse driving styles are necessary. To overcome these limitations, we propose the multi-style distributional soft actor-critic (M-DSAC) algorithm, capable of learning a single policy that supports multiple control behaviors. We begin by developing a multi-style policy iteration (MPI) framework that learns the entire distribution of returns, known as the value distribution, rather than just focusing on the expected return (i.e., the $Q$ value). In this framework, we utilize the quantile index of the value distribution as a style indicator, enhancing the inputs of both the policy and its corresponding value distribution with these quantile indices. Building upon the MPI framework, the M-DSAC algorithm employs a parameterized diagonal Gaussian function to approximate the value distribution. This approach enables efficient computation of different value quantiles by combining the value distribution's mean and standard deviations with appropriate coefficients. By optimizing the policy across different quantiles, M-DSAC efficiently learns a versatile policy that can handle a range of control styles without the burden of significant computing costs. Experimental evaluations using MuJoCo benchmarks and real-world robot control tasks confirm the effectiveness of M-DSAC, showcasing its broad practical applicability.