Jiahao Nick Li,Toby Chong,Zhongyi Zhou,Hironori Yoshida,Koji Yatani,Xiang 'Anthony' Chen,Takeo Igarashi
Abstract:Object pose estimation plays a vital role in mixed-reality interactions when users manipulate tangible objects as controllers. Traditional vision-based object pose estimation methods leverage 3D reconstruction to synthesize training data. However, these methods are designed for static objects with diffuse colors and do not work well for objects that change their appearance during manipulation, such as deformable objects like plush toys, transparent objects like chemical flasks, reflective objects like metal pitchers, and articulated objects like scissors. To address this limitation, we propose Rocap, a robotic pipeline that emulates human manipulation of target objects while generating data labeled with ground truth pose information. The user first gives the target object to a robotic arm, and the system captures many pictures of the object in various 6D configurations. The system trains a model by using captured images and their ground truth pose information automatically calculated from the joint angles of the robotic arm. We showcase pose estimation for appearance-changing objects by training simple deep-learning models using the collected data and comparing the results with a model trained with synthetic data based on 3D reconstruction via quantitative and qualitative evaluation. The findings underscore the promising capabilities of Rocap.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to accurately estimate the 6D pose (position and orientation) of objects with appearance - changing characteristics when they are used as controllers in mixed - reality interactions. Traditional vision - based pose - estimation methods are mainly applicable to static, uniformly - colored objects, and their performance is poor for objects whose appearance changes during the operation process (such as deformable plush toys, transparent chemical flasks, reflective metal kettles, and articulated scissors, etc.). To solve this problem, the paper proposes RoCap, an automated data - collection pipeline. It uses a robotic arm to imitate human hand operations to collect data of these appearance - changing objects, and trains a deep - learning model with this data to improve the accuracy of pose estimation.
### Specific problem description:
1. **Limitations of existing methods**: Existing pose - estimation methods are mainly for static objects, and these methods have poor performance for objects whose appearance changes during the operation process (for example, deformable objects, transparent objects, reflective objects, and articulated objects).
2. **Challenges in data collection**: Manually collecting data of these objects is very difficult and error - prone because it is necessary to accurately label the object pose in each image.
3. **Insufficiencies in model training**: Models trained with synthetic data or a small amount of labeled data perform poorly when dealing with appearance - changing objects in actual scenarios.
### Solutions:
- **RoCap system**: A proposed automated data - collection pipeline uses a 6 - degree - of - freedom robotic arm and an RGB camera to capture images of objects in different poses, and automatically labels the 6D pose of each image through the forward kinematics of the robotic arm.
- **Data processing and augmentation**: Process the collected data, generate object masks, and improve the generalization ability of the model through data - augmentation techniques.
- **Model training and evaluation**: Use the collected data to train a deep - learning model, and verify the performance of the model in dealing with appearance - changing objects through quantitative and qualitative evaluations.
### Main contributions:
1. **Automated data - collection pipeline**: Provides an automated method to collect and label 6D pose data of objects with appearance - changing characteristics, solving the problem of manual labeling.
2. **Performance verification**: By comparing with existing 3D - reconstruction - based few - shot - learning methods (such as Gen6D), the effectiveness of RoCap in improving pose - estimation accuracy is verified.
### Related work:
- **Object pose estimation**: Discusses the applications of object pose estimation in the fields of mixed - reality, robotics, and automation, as well as existing sensors, marking techniques, and computer - vision methods.
- **Data - collection methods**: Introduces methods for obtaining labeled data required by data - driven deep - learning methods, including synthetic data, public data sets, and interactive data - collection tools.
### Experimental results:
- **Quantitative evaluation**: In a controlled environment, the accuracy of the model is evaluated by changing the camera angle and background. The results show that RoCap performs better than existing 3D - reconstruction - based methods when dealing with appearance - changing objects.
- **Qualitative evaluation**: In an application setting, pose estimation is carried out while the user is operating the object, verifying the feasibility of the model in practical applications.
In conclusion, by proposing the RoCap system, this paper solves the limitations of existing methods in dealing with appearance - changing objects and provides a new solution for object pose estimation in mixed - reality interactions.