Transformers in Unsupervised Structure-from-Motion

Hemang Chawla,Arnav Varma,Elahe Arani,Bahram Zonooz
DOI: https://doi.org/10.1007/978-3-031-45725-8_14
2023-12-17
Abstract:Transformers have revolutionized deep learning based computer vision with improved performance as well as robustness to natural corruptions and adversarial attacks. Transformers are used predominantly for 2D vision tasks, including image classification, semantic segmentation, and object detection. However, robots and advanced driver assistance systems also require 3D scene understanding for decision making by extracting structure-from-motion (SfM). We propose a robust transformer-based monocular SfM method that learns to predict monocular pixel-wise depth, ego vehicle's translation and rotation, as well as camera's focal length and principal point, simultaneously. With experiments on KITTI and DDAD datasets, we demonstrate how to adapt different vision transformers and compare them against contemporary CNN-based methods. Our study shows that transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust against natural corruptions, as well as untargeted and targeted attacks.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily explores the advantages of using a Transformer-based architecture compared to traditional Convolutional Neural Networks (CNNs) in the task of unsupervised monocular Structure-from-Motion (SfM). Specifically: 1. **Unsupervised Monocular Depth Estimation**: The paper proposes a Transformer-based method for predicting pixel-level depth, ego-motion (translation and rotation), camera focal length, and principal point from a single monocular image. 2. **Performance and Robustness Comparison**: Through experiments on the KITTI and DDAD datasets, the paper compares different architectures (CNN and Transformer) in terms of network efficiency, robustness under natural disturbances, and performance under adversarial attacks. 3. **Camera Intrinsics Prediction**: A modular approach is introduced to predict the camera focal length and principal point from the input image, and this approach is applied to both CNN and Transformer architectures. The research results indicate that although the Transformer-based architecture is slightly less efficient than CNNs in terms of runtime, it demonstrates higher robustness and consistency under natural disturbances and adversarial attacks. Additionally, the method shows generalization capabilities across different datasets and tasks.