Multi-Timescale Ensemble Q-learning for Markov Decision Process Policy Optimization

Talha Bozkus,Urbashi Mitra

2024-02-08

Abstract:Reinforcement learning (RL) is a classical tool to solve network control or policy optimization problems in unknown environments. The original Q-learning suffers from performance and complexity challenges across very large networks. Herein, a novel model-free ensemble reinforcement learning algorithm which adapts the classical Q-learning is proposed to handle these challenges for networks which admit Markov decision process (MDP) models. Multiple Q-learning algorithms are run on multiple, distinct, synthetically created and structurally related Markovian environments in parallel; the outputs are fused using an adaptive weighting mechanism based on the Jensen-Shannon divergence (JSD) to obtain an approximately optimal policy with low complexity. The theoretical justification of the algorithm, including the convergence of key statistics and Q-functions are provided. Numerical results across several network models show that the proposed algorithm can achieve up to 55% less average policy error with up to 50% less runtime complexity than the state-of-the-art Q-learning algorithms. Numerical results validate assumptions made in the theoretical analysis.

Machine Learning,Signal Processing

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper primarily addresses some key challenges faced by reinforcement learning algorithms (specifically the Q-learning algorithm) in large-scale Markov Decision Process (MDP) environments. Specifically: 1. **High Complexity and Performance Issues**: Traditional Q-learning algorithms face issues such as high estimation bias, estimation variance, training instability, slow convergence speed, and high sample complexity when dealing with large-scale networks. 2. **Exploration Efficiency**: Achieving efficient and scalable exploration in large Markov environments remains a significant challenge. Too much or too little exploration can lead to suboptimal strategies or excessively high computational costs. To address these issues, the authors propose a novel multi-timescale ensemble Q-learning algorithm. This algorithm runs multiple Q-learning algorithms in several synthetic and structurally related Markov environments and uses a Jensen-Shannon Divergence (JSD) adaptive weighting mechanism to fuse the outputs, thereby obtaining a low-complexity approximately optimal strategy. Experimental results show that this method reduces the average policy error by up to 55% compared to existing Q-learning algorithms and reduces runtime complexity by 50%.

Multi-Timescale Ensemble Q-learning for Markov Decision Process Policy Optimization

Multi-Timescale Ensemble -Learning for Markov Decision Process Policy Optimization

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

Coverage Analysis of Multi-Environment Q-Learning Algorithms for Wireless Network Optimization

Q-learning Solution for Optimal Consensus Control of Discrete-Time Multiagent Systems Using Reinforcement Learning

Scalable spectral representations for multi-agent reinforcement learning in network MDPs

Dueling Network Architecture for Multi-Agent Deep Deterministic Policy Gradient

A Robust Policy Bootstrapping Algorithm for Multi-objective Reinforcement Learning in Non-stationary Environments

Decentralised Q-Learning for Multi-Agent Markov Decision Processes with a Satisfiability Criterion

A Multi-Agent Multi-Environment Mixed Q-Learning for Partially Decentralized Wireless Network Optimization

Model-Ensemble Trust-Region Policy Optimization

Phasic Parallel-Network Policy: a Deep Reinforcement Learning Framework Based on Action Correlation

Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

Leveraging Digital Cousins for Ensemble Q-Learning in Large-Scale Wireless Networks

Coverage Analysis for Digital Cousin Selection -- Improving Multi-Environment Q-Learning

Online Reinforcement Learning for Real-Time Exploration in Continuous State and Action Markov Decision Processes

Scalable Model-based Policy Optimization for Decentralized Networked Systems

$QD$-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations

Intrinsically Motivated Hierarchical Policy Learning in Multi-objective Markov Decision Processes

Self-Play Ensemble Q-learning enabled Resource Allocation for Network Slicing