Abstract:A world model creates a surrogate world to train a controller and predict safety violations by learning the internal dynamic model of systems. However, the existing world models rely solely on statistical learning of how observations change in response to actions, lacking precise quantification of how accurate the surrogate dynamics are, which poses a significant challenge in safety-critical systems. To address this challenge, we propose foundation world models that embed observations into meaningful and causally latent representations. This enables the surrogate dynamics to directly predict causal future states by leveraging a training-free large language model. In two common benchmarks, this novel model outperforms standard world models in the safety prediction task and has a performance comparable to supervised learning despite not using any data. We evaluate its performance with a more specialized and system-relevant metric by comparing estimated states instead of aggregating observation-wide error.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve zero - sample safety prediction in autonomous robots. Specifically, existing world models mainly rely on statistical learning to predict how observed values change with actions, but lack the quantification of the accuracy of these agent dynamic models, which is a major challenge in safety - critical systems. To address this challenge, the paper proposes a basic world model that can embed observed values into meaningful and interpretable latent representations, thereby directly predicting future states without using any training data. Through this method, the model can outperform standard world models in two common benchmark tests and perform comparably to supervised learning methods on safety prediction tasks. The main contributions of the paper include: 1. **Training - free world model**: Combining the interpretable embeddings of the basic model, it overcomes the problem of predicted observation distribution shift in standard world models. 2. **Segmentation - based prediction accuracy evaluation metric**: By quantifying the deviation of each object in the observation, it provides a more focused method for evaluating dynamic prediction accuracy. 3. **Experimental study on safety prediction**: Although not using any training data, the basic world model shows better performance in safety prediction compared to existing world models and supervised learning methods. ### Key technical points of the solution - **Interpretable latent representation**: Use a pre - trained basic segmentation model (such as Segment Anything Model, SAM) to segment the observed image into multiple meaningful objects and extract the positions of these objects as latent representations. - **Large - language model (LLM)**: Utilize large - language models (such as GPT 3.5 and Gemma) to predict latent states without collecting and labeling training data. - **Object - level prediction evaluation**: By calculating the centroid distance (Centroid Distance, CD) of objects, it provides a more refined method for evaluating prediction error, avoiding the limitations of traditional aggregation metrics such as MSE and SSIM. ### Experimental results - **State prediction**: In a shorter prediction time range, the errors of all methods are relatively low. As the prediction time range increases, the errors gradually increase. GPT 3.5 and Gemma show lower errors in most settings, especially when predicting the falling state of the inverted pendulum. - **Safety prediction**: The standard world model performs well in a shorter prediction time range. As the prediction time range increases, the SAM - based method shows a higher F1 score and a lower false positive rate (FPR). Through these techniques, the paper has successfully improved the prediction ability of autonomous robots in safety - critical tasks, especially in the absence of additional training data.

Zero-shot Safety Prediction for Autonomous Robots with Foundation World Models

World Models for Autonomous Driving: An Initial Survey

Exploring the Reliability of Foundation Model-Based Frontier Selection in Zero-Shot Object Goal Navigation

Predictive World Models from Real-World Partial Observations

UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Look Before You Leap: Safe Model-Based Reinforcement Learning with Human Intervention

World Models: The Safety Perspective

Exploring the Potential of World Models for Anomaly Detection in Autonomous Driving

Prospective Role of Foundation Models in Advancing Autonomous Vehicles

UniWorld: Autonomous Driving Pre-training via World Models

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Understanding World or Predicting Future? A Comprehensive Survey of World Models

An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training

Causal World Models by Unsupervised Deconfounding of Physical Dynamics

Neural World Models for Computer Vision

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

Robot Learning in the Era of Foundation Models: A Survey

Safedrive Dreamer: Navigating Safety–critical Scenarios in Autonomous Driving with World Models

Learning World Models for Unconstrained Goal Navigation

SafeDreamer: Safe Reinforcement Learning with World Models

How Safe Am I Given What I See? Calibrated Prediction of Safety Chances for Image-Controlled Autonomy