Abstract:Learned visual dynamics models have proven effective for robotic manipulation tasks. Yet, it remains unclear how best to represent scenes involving multi-object interactions. Current methods decompose a scene into discrete objects, but they struggle with precise modeling and manipulation amid challenging lighting conditions as they only encode appearance tied with specific illuminations. In this work, we propose using object-centric neural scattering functions (OSFs) as object representations in a model-predictive control framework. OSFs model per-object light transport, enabling compositional scene re-rendering under object rearrangement and varying lighting conditions. By combining this approach with inverse parameter estimation and graph-based neural dynamics models, we demonstrate improved model-predictive control performance and generalization in compositional multi-object environments, even in previously unseen scenarios and harsh lighting conditions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to perform accurate visual modeling and manipulation under complex and changing lighting conditions in multi - object interaction scenarios. Specifically, existing methods usually decompose the scene into discrete objects when dealing with multi - object interactions, but these methods perform poorly when dealing with changes in lighting conditions, especially in extremely harsh lighting conditions. The paper proposes a new method, that is, using object - centric neural scattering functions (OSFs) to represent objects and combining graph neural networks (GNNs) to predict dynamic behaviors in multi - object environments, thereby achieving more accurate model - predictive control (MPC). ### Main Contributions 1. **Inverse Parameter Estimation**: By using neural scattering functions (OSFs), this method can perform inverse parameter estimation under challenging and unseen lighting conditions, including object pose and lighting direction. 2. **Long - term Prediction**: This method can model the composition structure of the scene and support long - term prediction of future system states, thus supporting downstream planning tasks. 3. **Manipulation under Extreme Lighting**: Experiments show that this method can successfully perform manipulation tasks in simulated multi - object scenes containing extreme lighting directions. ### Method Overview 1. **Neural Implicit Scattering Functions (OSFs)**: - OSFs explicitly model the light transmission of each object and can predict the radiative transfer of the object according to the spatial position, incident light direction, and outgoing light direction. - Use KiloOSFs to accelerate the rendering process. KiloOSFs is an extension of NeRF and can handle complex light transmission and shadow effects. 2. **Inverse Parameter Estimation**: - Use covariance matrix adaptation (CMA) to optimize the 6D pose of each object and the lighting position. - Optimize the object pose and lighting parameters by minimizing the mean - squared error (MSE) between multi - view rendered images and observed images. 3. **Action - conditioned Dynamic Model**: - Train a graph neural network (GNN) dynamic model. The input is the current object pose and action, and the output is the 6D pose of the future object. - The dynamic model predicts future states through multiple inter - object propagation steps to handle multi - object interactions. 4. **Visual Model - Predictive Control**: - Given the target image and the initial visual observation, optimize the robot action sequence through sampling and forward prediction to reach the target. - Use MPPI to update the action sampling distribution and execute the first step of the optimal action sequence in the environment. - Update the object pose estimation through inverse parameter estimation and repeat this process for replanning. ### Experimental Results - **Visual Reconstruction**: KiloOSFs can reasonably render the color changes and shadows of objects under extreme lighting conditions, outperforming existing compositional NeRFs. - **Visual Prediction**: Compared with the FitVid model that predicts directly in the pixel space, the method combining the GNN dynamic model and the KiloOSFs rendering module shows higher accuracy in the long - prediction range. - **Model - Predictive Control**: Under random lighting conditions and unseen object configurations, this method shows better performance in model - predictive control tasks, especially in multi - object interaction scenarios. - **Generalization Ability**: This method can naturally handle different numbers of objects (such as 2 or 4 objects) without retraining the model. - **Real - World Application**: Under real - world extreme lighting conditions, this method can successfully estimate the lighting and object pose. In conclusion, the paper proposes a new method that solves the problem of visual modeling and manipulation under complex lighting conditions in multi - object interaction scenarios by combining neural scattering functions and graph neural networks.

Multi-Object Manipulation via Object-Centric Neural Scattering Functions

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Learning Object-Centric Neural Scattering Functions for Free-Viewpoint Relighting and Scene Composition

Learning Multi-Object Dynamics with Compositional Neural Radiance Fields

Object-Centric Neural Scene Rendering

Model predictive manipulation of compliant objects with multi-objective optimizer and adversarial network for occlusion compensation

Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering

Dynamics Learning with Object-Centric Interaction Networks for Robot Manipulation

Dynamic-Resolution Model Learning for Object Pile Manipulation

Learning Latent Object-Centric Representations for Visual-Based Robot Manipulation

Vision-Based Categorical Object Pose Estimation and Manipulation.

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

Latent Space Planning for Multiobject Manipulation With Environment-Aware Relational Classifiers

Latent Space Planning for Multi-Object Manipulation with Environment-Aware Relational Classifiers

KinScene: Model-Based Mobile Manipulation of Articulated Scenes

Unsupervised Discovery and Composition of Object Light Fields

Explicit Composition of Neural Radiance Fields by Learning an Occlusion Field.

Unsupervised Dynamics Prediction with Object-Centric Kinematics

Compositional 3D Human-Object Neural Animation

Novel-view Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views