A New Baseline of Policy Gradient for Traveling Salesman Problem

Ming Gu,Huan Yang
DOI: https://doi.org/10.1109/DSAA54385.2022.10032428
2022-10-13
Abstract:The Combinatorial optimization problem (COP) such as Traveling Salesman Problem (TSP) is widely used in various industries such as manufacturing, transportation, logistics and express delivery, etc. Deep reinforcement learning is the latest approach to solving the TSP. The policy gradient approach is an efficient and effective method to tackle the TSP, where critic and rollout baselines are often used. However, these baselines increase training time and memory space usage during the training process, and the training efficiency is not high. Therefore, this paper proposes a new baseline, Random Baseline, randomly selecting float numbers as a baseline from a range. Using a set of node coordinates as the input, we train a Long Short-Term Memory (LSTM) network to predict a distribution over city permutations, utilizing negative tour length as the reward, and optimizing the LSTM network parameters using a policy gradient with Random Baseline. The extensive experiments comparing the existing baselines demonstrate that the training time is reduced by 16% on average in Euclidean 2D TSP20, TSP50, and TSP100 tasks, on 1200000 training instances, respectively.
Computer Science
What problem does this paper attempt to address?