Abstract:Operation of Autonomous Mobile Robots (AMRs) of all forms that include wheeled ground vehicles, quadrupeds and humanoids in dynamically changing GPS denied environments without a-priori maps, exclusively using onboard sensors, is an unsolved problem that has potential to transform the economy, and vastly improve humanity's capabilities with improvements to agriculture, manufacturing, disaster response, military and space exploration. Conventional AMR automation approaches are modularized into perception, motion planning and control which is computationally inefficient, and requires explicit feature extraction and engineering, that inhibits generalization, and deployment at scale. Few works have focused on real-world end-to-end approaches that directly map sensor inputs to control outputs due to the large amount of well curated training data required for supervised Deep Learning (DL) which is time consuming and labor intensive to collect and label, and sample inefficiency and challenges to bridging the simulation to reality gap using Deep Reinforcement Learning (DRL). This paper presents a novel method to efficiently train DRL for robust end-to-end AMR exploration, in a constrained environment at physical limits in simulation, transferred zero-shot to the real-world. The representation learned in a compact parameter space with 2 fully connected layers with 64 nodes each is demonstrated to exhibit emergent behavior for out-of-distribution generalization to navigation in new environments that include unstructured terrain without maps, and dynamic obstacle avoidance. The learned policy outperforms conventional navigation algorithms while consuming a fraction of the computation resources, enabling execution on a range of AMR forms with varying embedded computer payloads.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficient exploration of Autonomous Mobile Robots (AMRs) using only on - board sensors in dynamically changing GPS - denied environments without pre - existing maps. Specifically, the paper focuses on how to enable AMRs to achieve end - to - end navigation in new and unknown environments, including navigation in unstructured terrains and dynamic obstacle avoidance, through a zero - sample Deep Reinforcement Learning (DRL) method.
### Background and Problem Description of the Paper
**Background**:
- Autonomous Mobile Robots (AMRs) have broad application potential in fields such as agriculture, manufacturing, disaster response, military, and space exploration.
- Current AMR automation methods are usually divided into three modules: perception, motion planning, and control. This method has low computational efficiency and requires explicit feature extraction and engineering design, which limits its generalization ability and large - scale deployment.
- Although some studies have attempted to use end - to - end methods to directly map sensor inputs to control outputs, these methods usually require a large amount of high - quality training data and face challenges in the transfer process from the simulated environment to the real world.
**Problem**:
- How can AMRs be made to conduct efficient exploration in dynamically changing environments without pre - existing maps?
- How can end - to - end navigation be achieved through the Deep Reinforcement Learning (DRL) method without a large amount of labeled data?
- How can the transfer challenges from the simulated environment to the real world be overcome to achieve zero - sample performance transfer?
### Main Contributions of the Paper
1. **First demonstration of the emergent behavior of DRL navigation agents**: By conducting more rigorous training in a constrained environment, the model can be extended to application scenarios outside the training distribution.
2. **Propose a new zero - sample strategy transfer technique from simulation to reality**: Combine high - fidelity sensor and actuator models and reward functions to learn the physical differences between the simulator and the real world.
3. **Provide insights into observation space planning and nonlinear learning curves**: Reveal the characteristics of on - policy DRL training in continuous observation and action spaces.
### Method Overview
- **Observation Space**: Use LiDAR or camera RGB - D point clouds to obtain depth observations and define a 1D observation space containing 170 depth measurement values.
- **Action Space**: Select normalized continuous actions, including throttle and steering angles, with a range between [- 1, 1].
- **Reward Function**: Tested shaped rewards and sparse rewards and finally selected sparse rewards, mainly rewarding high - throttle outputs and penalizing the negative product of continuous steering angles to reduce the impact of physical differences between the simulator and the real world.
- **Training and Evaluation**: Conduct training on the handicraft multi - directional track, use Intel Core i9 13900KF CPU and NVIDIA GeForce RTX 4090 GPU to accelerate training, with a total of 20,000,000 steps, corresponding to 15,747 training cycles. Conducted zero - sample transfer tests in various real - world scenarios, including tracks with different layouts, laboratory obstacle environments, parking lot exploration, forest unstructured terrain navigation, and dynamic obstacle avoidance.
### Results and Analysis
- **Learning Curve**: The learning curve during the training process is nonlinear. After multiple trials and adjustments, the time - optimal trajectory was finally achieved after 15,747 training cycles.
- **Performance in the Training Environment**: On the multi - directional track, the DRL model can calculate the time - optimal trajectory and navigate bends with multiple different directions and radii, similar to the performance of professional racing drivers.
- **Transfer Performance in the Real World**: In various real - world scenarios, the model has demonstrated good generalization ability and can perform well in new track layouts, laboratory obstacle environments, parking lots, forest terrains, and dynamic obstacle avoidance.
### Conclusion
This paper successfully solves the problem of AMRs conducting efficient exploration in dynamically changing environments without pre - existing maps by proposing a new DRL method. Through high - fidelity simulators and carefully designed reward functions, zero - sample transfer from simulation to reality is achieved, providing new possibilities for the wide application of AMRs.