Abstract:In the realm of computer vision, the perception and reconstruction of the 3D world through vision signals heavily rely on camera intrinsic parameters, which have long been a subject of intense research within the community. In practical applications, without a strong scene geometry prior like the Manhattan World assumption or special artificial calibration patterns, monocular focal length estimation becomes a challenging task. In this paper, we propose a method for monocular focal length estimation using category-level object priors. Based on two well-studied existing tasks: monocular depth estimation and category-level object canonical representation learning, our focal solver takes depth priors and object shape priors from images containing objects and estimates the focal length from triplets of correspondences in closed form. Our experiments on simulated and real world data demonstrate that the proposed method outperforms the current state-of-the-art, offering a promising solution to the long-standing monocular focal length estimation problem.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the difficult problem of focal length estimation in monocular images. Specifically, the author proposes a method based on category - level object priors to estimate the focal length of monocular images. Traditionally, focal length estimation depends on strong scene geometry assumptions or special artificial calibration patterns, but in practical applications these assumptions are often difficult to meet, making monocular focal length estimation very challenging. The main contributions of the paper include: 1. **Proposing a new focal length estimation method**: Using category - level object priors and monocular depth estimation, a simple and efficient minimal solver is proposed. This is the first time that these two are combined for focal length estimation. 2. **Demonstrating the effectiveness and robustness of the method**: Through experimental verification on simulated data and real - world data, it is proved that the proposed method is superior to the current state - of - the - art monocular focal length estimation methods. ### Importance of focal length estimation The focal length is an important parameter in camera internal parameters. In the field of computer vision, especially in tasks such as 3D reconstruction, Structure from Motion, and visual SLAM, it plays a crucial role. Accurate focal length estimation can significantly improve the performance of these tasks. However, when there is only one image, traditional focal length estimation methods usually need to rely on strong assumptions, such as known scene geometry or specific objects, which are not applicable in many cases. ### Method overview The process of the method proposed in the paper is as follows: 1. **Input**: An RGB image containing objects of known categories. 2. **Pre - processing**: Use the existing monocular depth predictor and Normalized Object Coordinates (NOCs) predictor to obtain the depth and 3D canonical points of each visible 2D image point. 3. **Geometric relationship constraints**: According to the geometric relationship between 2D image points and 3D NOCs, establish constraints on unknown internal parameters and object poses. 4. **Focal length estimation**: Through the proposed fCOP solver, use triplets of three corresponding points to estimate the focal length in a closed - form. ### Formula summary The key formulas involved in the paper are as follows: - Perspective transformation relationship under the camera model: \[ d_iK^{- 1}\tilde{x}_i+\epsilon_{d_i}=sR(p_i+\epsilon_{p_i})+t + o_i \] where \(K\) is the unknown camera internal parameter matrix, \(s, R, t\) are the unknown scale, rotation and translation respectively, \(\epsilon_{d_i}\) and \(\epsilon_{p_i}\) represent depth noise and NOCs noise respectively, and \(o_i\) is the zero vector corresponding to inliers or any vector corresponding to outliers. - Geometric relationship after eliminating translation: \[ K^{-1}(d_i\tilde{x}_i - d_j\tilde{x}_j)=sR(p_i - p_j) \] - Norm relationship after eliminating rotation: \[ \|K^{-1}(d_i\tilde{x}_i - d_j\tilde{x}_j)\|=s\|p_i - p_j\| \] - Expression form of the linear system: \[ \begin{bmatrix} \|p_i - p_j\|^2-\|d_i x_i - d_j x_j\|^2\\ \|p_j - p_k\|^2-\|d_j x_j - d_k x_k\|^2\\ \|p_j - p_k\|^2-\|d_j x_j - d_k x_k\|^2 \end{bmatrix} \begin{bmatrix} s^2\\ 1/f^2 \end{bmatrix} = \begin{bmatrix} (d_i - d_j)^2\\ (d_j - d_k)^2\\ (d_i - d_k)^2 \end{bmatrix} \]

fCOP: Focal Length Estimation from Category-level Object Priors

Calibration-free Deep Optics for Depth Estimation with Precise Simulation

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

Fast and Accurate Pose Estimation with Unknown Focal Length Using Line Correspondences

A Reliable Online Method for Joint Estimation of Focal Length and Camera Rotation

Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking

Robust Focal Length Estimation Based on Minimal Solution Method

Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference

Pose and Focal Length Estimation Using a Point and a Line With Known Camera Position

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Focal Depth Estimation: A Calibration-Free, Subject- and Daytime Invariant Approach

Task-Aware Monocular Depth Estimation for 3D Object Detection

DONet: Learning Category-Level 6D Object Pose and Size Estimation from Depth Observation

CATRE: Iterative Point Clouds Alignment for Category-Level Object Pose Refinement

MonoCD: Monocular 3D Object Detection with Complementary Depths

Leveraging SE(3) Equivariance for Self-Supervised Category-Level Object Pose Estimation

Object Level Depth Reconstruction for Category Level 6D Object Pose Estimation from Monocular RGB Image

Towards Accurate Reconstruction of 3D Scene Shape From A Single Monocular Image

OCM3D: Object-Centric Monocular 3D Object Detection

Deep eyes: Joint depth inference using monocular and binocular cues