Multi-Timescale Ensemble -Learning for Markov Decision Process Policy Optimization

Talha Bozkus,Urbashi Mitra
DOI: https://doi.org/10.1109/tsp.2024.3372699
IF: 4.875
2024-03-22
IEEE Transactions on Signal Processing
Abstract:Reinforcement learning (RL) is a classical tool to solve network control or policy optimization problems in unknown environments. The original -learning suffers from performance and complexity challenges across very large networks. Herein, a novel model-free ensemble reinforcement learning algorithm which adapts the classical -learning is proposed to handle these challenges for networks which admit Markov decision process (MDP) models. Multiple -learning algorithms are run on multiple, distinct, synthetically created and structurally related Markovian environments in parallel; the outputs are fused using an adaptive weighting mechanism based on the Jensen-Shannon divergence (JSD) to obtain an approximately optimal policy with low complexity. The theoretical justification of the algorithm, including the convergence of key statistics and -functions are provided. Numerical results across several network models show that the proposed algorithm can achieve up to 55% less average policy error with up to 50% less runtime complexity than the state-of-the-art -learning algorithms. Numerical results validate assumptions made in the theoretical analysis.
engineering, electrical & electronic
What problem does this paper attempt to address?