Abstract:We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **How to achieve humanoid robot control in the real world by predicting the "token" of the next sensor motion trajectory?** Specifically, the author transforms the humanoid robot control problem into a problem similar to "next - word prediction" in the language model, and trains a causal transformer model in an autoregressive manner to predict the time series of sensors and motion commands. ### Main Contributions 1. **Processing of Multimodal Data**: Considering the multimodal characteristics of robot data (such as sensor data, joint encoders, inertial measurement units, etc.), this model can perform alignment prediction between different modalities, that is, for each input token, predict the next token from the same modality. 2. **Ability to Handle Incomplete Data**: This model can handle trajectory data containing missing information (such as data with only video but no motion commands). For the missing parts, replace them with learnable mask tokens, thus allowing learning from a wider range of data sources. 3. **Zero - Shot Transfer Ability**: The trained model can be directly deployed in unseen real - world environments, such as walking on different terrains in San Francisco, and even including actions that did not appear in the training set, such as walking backwards. ### Method Overview - **Data Sources**: Collected sensor motion trajectory datasets from multiple sources, including trajectories generated by neural network policies, trajectories generated by model - based controllers, human motion capture data, and human motions in YouTube videos. - **Model Architecture**: Adopted a standard transformer model. By tokenizing the time - series data and using the causal mask mechanism to ensure that the model can only focus on past information for prediction. - **Training Objectives**: Minimize the negative log - likelihood loss, that is, maximize the probability distribution of predicting the next token. In addition, the mean - square error (MSE) is also used as an auxiliary loss function to optimize the regression task. ### Experimental Verification - **Real - World Deployment**: Demonstrated that this model can run successfully on actual hardware platforms and perform stable walking tasks in complex outdoor environments. - **Quantitative Evaluation**: Compared the performance of the proposed method with other existing methods through two indicators, trajectory tracking error and prediction error, proving its superiority and generalization ability. In conclusion, this paper proposes a novel and effective method to train a humanoid robot control system using large - scale sensor motion trajectory data, demonstrating its potential in real - world applications.

Humanoid Locomotion as Next Token Prediction