Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

Angtian Wang,Shenxiao Mei,Alan Yuille,Adam Kortylewski
DOI: https://doi.org/10.48550/arXiv.2110.14213
2021-10-27
Abstract:We study the problem of learning to estimate the 3D object pose from a few labelled examples and a collection of unlabelled data. Our main contribution is a learning framework, neural view synthesis and matching, that can transfer the 3D pose annotation from the labelled to unlabelled images reliably, despite unseen 3D views and nuisance variations such as the object shape, texture, illumination or scene context. In our approach, objects are represented as 3D cuboid meshes composed of feature vectors at each mesh vertex. The model is initialized from a few labelled images and is subsequently used to synthesize feature representations of unseen 3D views. The synthesized views are matched with the feature representations of unlabelled images to generate pseudo-labels of the 3D pose. The pseudo-labelled data is, in turn, used to train the feature extractor such that the features at each mesh vertex are more invariant across varying 3D views of the object. Our model is trained in an EM-type manner alternating between increasing the 3D pose invariance of the feature extractor and annotating unlabelled data through neural view synthesis and matching. We demonstrate the effectiveness of the proposed semi-supervised learning framework for 3D pose estimation on the PASCAL3D+ and KITTI datasets. We find that our approach outperforms all baselines by a wide margin, particularly in an extreme few-shot setting where only 7 annotated images are given. Remarkably, we observe that our model also achieves an exceptional robustness in out-of-distribution scenarios that involve partial occlusion.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively estimate the pose of 3D objects when there are only a small amount of labeled data and a large amount of unlabeled data. Specifically, the author proposes a semi - supervised learning framework. Through the Neural View Synthesis and Matching (NVSM) technology, the 3D pose annotations learned from a small number of labeled images are transferred to unlabeled images, thereby generating pseudo - labels. This process can not only handle unseen 3D views, but also deal with interference factors such as changes in object shape, texture, illumination or scene context. ### Core contributions of the paper 1. **Proposed the NVSM method**: Through the synthesis and matching of neural network feature maps, the 3D pose information in the labeled images is reliably transferred to the unlabeled images, even if these unlabeled images contain unseen 3D views and various interference changes. 2. **Improved the 3D pose invariance of the feature extractor**: Use the generated pseudo - label data to train the feature extractor, making it more invariant under different 3D views, thereby improving the accuracy of the pseudo - labels. 3. **Achieved efficient semi - supervised learning**: Through an EM - type iterative process, gradually increase the range of 3D pose differences in the synthesized views, and finally achieve effective labeling of unlabeled data and model training. ### Technical details - **Neural view synthesis**: - Use a pre - trained convolutional neural network (such as ResNet50) to extract the image feature map \( F \). - Project a 3D cube grid onto the feature map and sample the feature vectors of each visible grid vertex. - Rotate the 3D cube grid to a new pose \( \theta+\Delta\theta \) and generate a feature map \( F_{\theta'} \) in the new view through rasterization. - **Spatial matching**: - Calculate the similarity between the synthesized feature map \( F_{\theta'} \) and the unlabeled image feature map \( F_m \), using cosine distance as a metric: \[ S(F_{\theta'}, F_m)=\frac{1}{HW}\sum_{h}\sum_{w}[1 - d(F_{\theta'}(h, w), F_m(h, w))] \] - Select the unlabeled image with the highest similarity and assign it a pseudo - label \( \theta' \). - **Training of the feature extractor**: - Use a contrastive loss function to train the feature extractor, maximizing the feature similarity of the same vertex in different images, while minimizing the feature similarity between different vertices: \[ L^{+}(F_i, F_j, \Gamma)=\sum_{r = 1}^{R}[1 - d(F_i(P_{\theta_i}\cdot x_r), F_j(P_{\theta_j}\cdot x_r))] \] \[ L^{-}(F_i, F_j, \Gamma)=\sum_{r = 1}^{R}\sum_{r'\neq r}d(F_i(P_{\theta_i}\cdot x_r), F_j(P_{\theta_j}\cdot x_r')) \] ### Experimental results - **PASCAL3D+ dataset**: In the extremely few - shot setting (only 7 labeled images), this method significantly outperforms other baseline methods, especially in the case of partial occlusion. - **KITTI dataset**: Under different occlusion levels, this method also performs well. Especially in the cases of partial occlusion and most occlusion, the accuracy and median error are better than other methods. In conclusion, this paper proposes an innovative semi - supervised learning framework that can effectively estimate the pose of 3D objects with a very small amount of labeled data, which has high practical value and research significance.