Abstract:Many applications such as autonomous navigation, urban planning and asset monitoring, rely on the availability of accurate information about objects and their geolocations. In this paper we propose to automatically detect and compute the GPS coordinates of recurring stationary objects of interest using street view imagery. Our processing pipeline relies on two fully convolutional neural networks: the first segments objects in the images while the second estimates their distance from the camera. To geolocate all the detected objects coherently we propose a novel custom Markov Random Field model to perform objects triangulation. The novelty of the resulting pipeline is the combined use of monocular depth estimation and triangulation to enable automatic mapping of complex scenes with multiple visually similar objects of interest. We validate experimentally the effectiveness of our approach on two object classes: traffic lights and telegraph poles. The experiments report high object recall rates and GPS accuracy within 2 meters, which is comparable with the precision of single-frequency GPS receivers.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to automatically detect and calculate the GPS coordinates of repeatedly - occurring fixed objects in street - view images. Specifically, the authors propose a processing pipeline based on street - view images for automatically discovering and geotagging these fixed objects. These problems are very important in applications such as autonomous driving navigation, urban planning, and asset monitoring, because these applications rely on accurate object location information. ### Core Contributions of the Paper 1. **Complete Image - Processing Pipeline**: The paper proposes a complete image - processing pipeline for geotagging repeatedly - occurring fixed objects from street - view images. The main components of this pipeline include two fully convolutional neural networks (FCNNs), one for semantic segmentation and the other for monocular depth estimation, as well as a new geotagging model based on Markov random fields (MRF), which achieves automatic mapping by combining depth information and geometric triangulation. 2. **Handling Partially or Completely Occluded Objects**: This method can handle partially or completely occluded objects without the need for explicit geometric modeling or relying on object position patterns. 3. **Modular Design**: The proposed pipeline is modular and can replace the segmentation and depth modules with pre - trained solutions for specific object categories. ### Experimental Verification The paper verifies the effectiveness of the method on two object categories: traffic lights and utility poles. The experimental results show that this method has high precision in object discovery and geotagging, with a GPS error within 2 meters, which is equivalent to the precision of a single - frequency GPS receiver. ### Specific Technical Details 1. **Object Segmentation**: Use the state - of - the - art fully convolutional neural network (FCNN) for semantic segmentation and output pixel - level labels for subsequent depth estimation. 2. **Monocular Depth Estimation**: Use a fully convolutional depth - estimation pipeline based on ResNet - 50 to estimate the distance from the camera to the object from a single image. 3. **Geotagging**: - **Single - View Localization**: Extract the geographical direction of the object relative to the camera from the segmentation map, and calculate the GPS position of the object by combining depth estimation. - **Multi - View Localization**: When the object is observed from multiple perspectives, solve the redundant detection problem through triangulation and MRF optimization, and finally obtain a consistent list of objects. ### MRF Model The MRF model is used to optimize object localization under multiple views and is defined by the following energy terms: - **Unary Energy Term**: Ensure the consistency between the triangulation distance and the depth estimation. - **Binary Energy Term**: Penalize the cases where multiple objects occlude each other and are overly dispersed. - **Ternary Energy Term**: Penalize rays without orthogonal points to reduce false detections. ### Experimental Results - **Traffic Lights**: Tests were carried out within a 0.8 - kilometer section of Regent Street in London. A total of 51 object instances were detected, of which 47 were accurate, with a recall rate of 0.922 and a precision rate of 0.922. - **Utility Poles**: Segmentation was carried out on a customized training data set, and the experimental results showed a high recall rate and precision rate. In conclusion, this paper proposes an effective method for automatically discovering and geotagging fixed objects using street - view images, which has broad application prospects.

Automatic Discovery and Geotagging of Objects from Street View Imagery

Automated detecting and placing road objects from street-level images

A joint deep learning network of point clouds and multiple views for roadside object classification from lidar point clouds

Resource-Constrained Simultaneous Detection and Labeling of Objects in High-Resolution Satellite Images

Urban Visual Localization of Block-Wise Monocular Images with Google Street Views

Roadside HD Map Object Reconstruction Using Monocular Camera

Rendering-Enhanced Automatic Image-to-Point Cloud Registration for Roadside Scenes

Crowdsourced 3D Mapping: A Combined Multi-View Geometry and Self-Supervised Learning Approach

Learning from Maps: Visual Common Sense for Autonomous Driving

Aerial image geolocalization from recognition and matching of roads and intersections

Automatic Annotation of Geo-Information in Panoramic Street View by Image Retrieval

Automatic Map Update Using Dashcam Videos

Monocular Visual Object 3D Localization in Road Scenes

3D Extended Object Tracking by Fusing Roadside Sparse Radar Point Clouds and Pixel Keypoints

Landmark Localization for Drone Aerial Mapping Using GPS and Sparse Point Cloud for Photogrammetry Pipeline Automation

Geo-locating Road Objects using Inverse Haversine Formula with NVIDIA Driveworks

Vision-based Global Localization of Unmanned Aerial Vehicles with Street View Images

Automated Static Camera Calibration with Intelligent Vehicles

The Earth ain't Flat: Monocular Reconstruction of Vehicles on Steep and Graded Roads from a Moving Camera

Enhanced Monocular Visual Odometry with AR Poses and Integrated INS-GPS for Robust Localization in Urban Environments

Understanding urban landuse from the above and ground perspectives: A deep learning, multimodal solution