Abstract:To achieve human-level dexterity, robots must infer spatial awareness from multimodal sensing to reason over contact interactions. During in-hand manipulation of novel objects, such spatial awareness involves estimating the object's pose and shape. The status quo for in-hand perception primarily employs vision, and restricts to tracking a priori known objects. Moreover, visual occlusion of objects in-hand is imminent during manipulation, preventing current systems to push beyond tasks without occlusion. We combine vision and touch sensing on a multi-fingered hand to estimate an object's pose and shape during in-hand manipulation. Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem. We study multimodal in-hand perception in simulation and the real-world, interacting with different objects via a proprioception-driven policy. Our experiments show final reconstruction F-scores of $81$% and average pose drifts of $4.7\,\text{mm}$, further reduced to $2.3\,\text{mm}$ with known CAD models. Additionally, we observe that under heavy visual occlusion we can achieve up to $94$% improvements in tracking compared to vision-only methods. Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation. We release our evaluation dataset of 70 experiments, FeelSight, as a step towards benchmarking in this domain. Our neural representation driven by multimodal sensing can serve as a perception backbone towards advancing robot dexterity. Videos can be found on our project website <a class="link-external link-https" href="https://suddhu.github.io/neural-feels/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the spatial perception problem of robots during in - hand manipulation. Specifically, the paper proposes a new method - NeuralFeels, which estimates the pose and shape of objects by combining visual and tactile sensors. The goals of this method are: 1. **Improve spatial perception ability**: Robots need to infer spatial awareness from multi - modal perception in order to better handle contact interactions. Especially when manipulating new objects in - hand, this spatial awareness involves estimating the pose and shape of the object. 2. **Overcome the limitations of existing methods**: Current in - hand perception methods mainly rely on vision and are limited to tracking known objects. In addition, visual occlusion of objects during in - hand manipulation is inevitable, which limits the application range of existing systems. 3. **Achieve robust tracking and reconstruction**: By combining visual and tactile sensors, NeuralFeels can learn the neural field model online and jointly track the pose and shape of objects by optimizing the pose graph problem. Experimental results show that this method can achieve high - precision object pose tracking and shape reconstruction in both real - world and simulated environments. ### Main contributions 1. **Multi - modal perception fusion**: NeuralFeels combines vision, touch, and proprioception to construct a continuous neural field representation for real - time estimation of the pose and shape of objects. 2. **Robust object tracking**: In the case where the object is partially or completely occluded, tactile information can significantly improve the accuracy of visual estimation, thereby achieving more stable object tracking. 3. **Online learning and optimization**: This method can reconstruct the 3D model of the object in real - time without pre - labeling by learning the neural field model online. 4. **Experimental verification**: The authors conducted a large number of experiments in simulation and the real world to verify the performance of NeuralFeels on different objects. The experimental results show that this method performs excellently in object pose tracking and shape reconstruction, especially in the case of severe visual occlusion. ### Experimental results - **Object reconstruction**: On new objects, the average reconstruction F - score of NeuralFeels reaches 81%, and the average pose drift is only 4.7 mm. When using a known CAD model, the pose drift is further reduced to 2.3 mm. - **Pose tracking**: Under the condition of severe visual occlusion, the pose tracking performance of NeuralFeels is improved by 94% compared with the method using only vision. - **Advantages of multi - modal fusion**: Tactile information performs excellently in reducing visual noise and supplementing information in occluded areas, significantly improving the accuracy and robustness of overall perception. ### Conclusion NeuralFeels realizes robust pose tracking and shape reconstruction of new objects by combining visual and tactile sensors. This method not only performs excellently in the case of severe visual occlusion but also provides new possibilities for future dexterous manipulation of robots in complex environments.

Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation

NeuralFeels with neural fields: Visuotactile perception for in-hand manipulation.

Dexterous Manoeuvre Through Touch in a Cluttered Scene

Robot Synesthesia: In-Hand Manipulation with Visuotactile Sensing

See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation

Rotating without Seeing: Towards In-hand Dexterity through Touch

Proprioception and Exteroception of a Soft Robotic Finger Using Neuromorphic Vision-Based Sensing

PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic Manipulation

Multifingered Robot Hand Compliant Manipulation Based on Vision-Based Demonstration and Adaptive Force Control

SmartHand: Towards Embedded Smart Hands for Prosthetic and Robotic Applications

More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch

Learning Fine Pinch-Grasp Skills using Tactile Sensing from A Few Real-world Demonstrations

Learning Visuotactile Skills with Two Multifingered Hands

EyeSight Hand: Design of a Fully-Actuated Dexterous Robot Hand with Integrated Vision-Based Tactile Sensors and Compliant Actuation

OmniTact: A Multi-Directional High Resolution Touch Sensor

MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation

Fusing Visuo-Tactile Perception into Kernelized Synergies for Robust Grasping and Fine Manipulation of Non-rigid Objects

Multifingered Grasping Based on Multimodal Reinforcement Learning

Capturing forceful interaction with deformable objects using a deep learning-powered stretchable tactile array

Imagine2touch: Predictive Tactile Sensing for Robotic Manipulation using Efficient Low-Dimensional Signals

3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing