Paulo R. de O. da Costa,Jason Rhuggenaath,Yingqian Zhang,Alp Akcay
Abstract:Recent works using deep learning to solve the Traveling Salesman Problem (TSP) have focused on learning construction heuristics. Such approaches find TSP solutions of good quality but require additional procedures such as beam search and sampling to improve solutions and achieve state-of-the-art performance. However, few studies have focused on improvement heuristics, where a given solution is improved until reaching a near-optimal one. In this work, we propose to learn a local search heuristic based on 2-opt operators via deep reinforcement learning. We propose a policy gradient algorithm to learn a stochastic policy that selects 2-opt operations given a current solution. Moreover, we introduce a policy neural network that leverages a pointing attention mechanism, which unlike previous works, can be easily extended to more general k-opt moves. Our results show that the learned policies can improve even over random initial solutions and approach near-optimal solutions at a faster rate than previous state-of-the-art deep learning methods.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to learn the 2 - opt heuristic algorithm through deep reinforcement learning in the Traveling Salesman Problem (TSP). Specifically, the goal of the paper is to develop a local search heuristic algorithm based on 2 - opt operations, and learn a stochastic policy through deep reinforcement learning. This policy can select appropriate 2 - opt operations to improve the quality of the solution given the current solution. Different from previous research mainly focusing on constructive heuristic algorithms, this paper focuses on improving heuristic algorithms, that is, gradually optimizing to near - optimal solutions through local moves based on existing solutions.
### Main contributions of the paper:
1. **Proposed a deep reinforcement learning algorithm based on 2 - opt operations**: This algorithm learns a stochastic policy through the policy gradient method, which can select the 2 - opt operation that is most likely to improve the solution according to the current solution.
2. **Introduced the pointing - attention mechanism**: This mechanism can not only effectively select 2 - opt operations, but also can be easily extended to more general k - opt operations.
3. **Experimental results show**: The learned policy can quickly approach the approximate optimal solution when starting from a poor initial solution, and is superior to existing deep learning methods in terms of solution quality and sample efficiency.
### Specific problem description:
- **TSP problem**: Given a set of nodes (cities), find the shortest path such that each node is visited exactly once and returns to the starting point.
- **NP - hard problem**: Even in Euclidean space, TSP is an NP - hard problem, and it is difficult for traditional methods to solve it efficiently on large - scale instances.
- **Limitations of existing methods**: Most existing methods focus on constructive heuristic algorithms. Although they can generate high - quality solutions, they usually require additional steps such as beam search and sampling to further optimize the solutions. However, research on improving heuristic algorithms is relatively scarce.
### Solutions:
- **Deep reinforcement learning framework**: Model the TSP problem as a Markov decision process (MDP), where the state consists of the current solution and the known best solution, the action corresponds to the 2 - opt operation, and the reward is based on the degree of improvement of the solution.
- **Neural network architecture**: Use an encoder - decoder structure, where the encoder extracts graph and sequence information, and the decoder generates actions and evaluates state values.
- **Policy gradient optimization**: Optimize the policy and value network parameters through the policy gradient method to ensure that the learned policy can maximize the cumulative reward in the long term.
### Experimental results:
- **Performance comparison**: The experimental results show that the proposed algorithm can achieve near - optimal solution results on TSP instances of different scales, especially when starting from a poor initial solution.
- **Sample efficiency**: Compared with other deep learning methods, the method in this paper has an advantage in sample efficiency and can achieve similar or better performance with fewer training samples.
In conclusion, this paper learns the 2 - opt heuristic algorithm in the TSP problem through the deep reinforcement learning method, providing a new method for improving solutions, which has high practical value and research significance.