Abstract:Scheduling is a fundamental task occurring in various automated systems applications, e.g., optimal schedules for machines on a job shop allow for a reduction of production costs and waste. Nevertheless, finding such schedules is often intractable and cannot be achieved by Combinatorial Optimization Problem (COP) methods within a given time limit. Recent advances of Deep Reinforcement Learning (DRL) in learning complex behavior enable new COP application possibilities. This paper presents an efficient DRL environment for Job-Shop Scheduling -- an important problem in the field. Furthermore, we design a meaningful and compact state representation as well as a novel, simple dense reward function, closely related to the sparse make-span minimization criteria used by COP methods. We demonstrate that our approach significantly outperforms existing DRL methods on classic benchmark instances, coming close to state-of-the-art COP approaches.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the Job - Shop Scheduling (JSS) problem. Specifically, JSS is a classic combinatorial optimization problem, whose goal is to find the optimal scheduling scheme on a given set of jobs and machines to minimize the maximum value of the completion time of all jobs (i.e., minimize the make - span). However, due to the complexity of the JSS problem, traditional combinatorial optimization methods (such as linear programming or constraint programming) often fail to find the optimal solution within the given time limit. Therefore, this paper proposes a method based on Deep Reinforcement Learning (DRL) to solve this problem.
### Main contributions of the paper
1. **Modeling and algorithm**:
- Model the JSS problem as a single - agent reinforcement learning problem, where the agent (scheduler) needs to select the job to be processed at each step.
- Use the Proximal Policy Optimization (PPO) algorithm to learn the policy and state - value function.
2. **Environment design**:
- Design a meaningful and compact state representation method, and a novel and simple dense reward function, which is closely related to the sparse make - span minimization objective used in combinatorial optimization methods.
3. **Experimental results**:
- Experiments on classic benchmark instances show that the proposed DRL method significantly outperforms existing DRL methods and is close to the best scheduling techniques on the market.
### Background knowledge
- **Markov Decision Process (MDP)**: Defines a quadruple \(M=(S, A, P_a, R_a)\), where \(S\) is the state space, \(A\) is the action space, \(P_a(s, s')\) is the transition probability from state \(s\) to state \(s'\), and \(R_a(s, s')\) is the reward from state \(s\) to state \(s'\).
- **Reinforcement Learning (RL)**: The goal is to learn a policy \(\pi\) such that the action taken in a given state can maximize the expected cumulative reward.
- **PPO algorithm**: Calculate the estimated value of the policy gradient and use the Stochastic Gradient Ascent (SGA) algorithm to update the policy parameters, while avoiding performance collapse by clipping the objective function.
### Related work
- **Multi - agent methods**: Some studies consider each machine as an agent and use multi - agent systems to solve the JSS problem. For example, Waschneck et al. use DQN to train multiple agents, while Liu et al. use Deep Deterministic Policy Gradient (DDPG) to train agents.
- **Single - agent methods**: Zhang et al. use Graph Neural Network (GNN) and Multi - Layer Perceptron (MLP) to design state embeddings and action probability distributions. Han and Yang encode the state as an image and use Convolutional Neural Network (CNN) to approximate the state - action value function.
### Method details
- **Environment design**:
- The action space includes all jobs plus a "No - Op" (No - Operation), which is used to skip the current time step.
- The state is represented as a \(J\times7\) matrix, containing 7 attributes of each job, such as whether it is assignable, the remaining operation time, the percentage of completed operations, etc.
- The reward function is based on the difference in scheduling areas and aims to minimize the idle time of machines and maximize machine utilization.
- **Training process**:
- Use the PPO algorithm for training and set a series of hyper - parameters, such as the number of network update rounds, the clipping parameter, the coefficients of the policy loss and the value function, etc.
- To improve efficiency, use WandB for hyper - parameter search and logging.
### Experimental results
- **Benchmark instances**:
- Select the classic benchmark instances provided by Taillard, especially those instances with 30 jobs and 20 machines, which are considered more difficult.
- Also select another set of instances provided by Demirkol et al. to verify the generalization ability of the method.
### Conclusion