Automatic Discovery and Geotagging of Objects from Street View Imagery

Vladimir A. Krylov,Eamonn Kenny,Rozenn Dahyot
DOI: https://doi.org/10.3390/rs10050661
2017-12-02
Abstract:Many applications such as autonomous navigation, urban planning and asset monitoring, rely on the availability of accurate information about objects and their geolocations. In this paper we propose to automatically detect and compute the GPS coordinates of recurring stationary objects of interest using street view imagery. Our processing pipeline relies on two fully convolutional neural networks: the first segments objects in the images while the second estimates their distance from the camera. To geolocate all the detected objects coherently we propose a novel custom Markov Random Field model to perform objects triangulation. The novelty of the resulting pipeline is the combined use of monocular depth estimation and triangulation to enable automatic mapping of complex scenes with multiple visually similar objects of interest. We validate experimentally the effectiveness of our approach on two object classes: traffic lights and telegraph poles. The experiments report high object recall rates and GPS accuracy within 2 meters, which is comparable with the precision of single-frequency GPS receivers.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to automatically detect and calculate the GPS coordinates of repeatedly - occurring fixed objects in street - view images. Specifically, the authors propose a processing pipeline based on street - view images for automatically discovering and geotagging these fixed objects. These problems are very important in applications such as autonomous driving navigation, urban planning, and asset monitoring, because these applications rely on accurate object location information. ### Core Contributions of the Paper 1. **Complete Image - Processing Pipeline**: The paper proposes a complete image - processing pipeline for geotagging repeatedly - occurring fixed objects from street - view images. The main components of this pipeline include two fully convolutional neural networks (FCNNs), one for semantic segmentation and the other for monocular depth estimation, as well as a new geotagging model based on Markov random fields (MRF), which achieves automatic mapping by combining depth information and geometric triangulation. 2. **Handling Partially or Completely Occluded Objects**: This method can handle partially or completely occluded objects without the need for explicit geometric modeling or relying on object position patterns. 3. **Modular Design**: The proposed pipeline is modular and can replace the segmentation and depth modules with pre - trained solutions for specific object categories. ### Experimental Verification The paper verifies the effectiveness of the method on two object categories: traffic lights and utility poles. The experimental results show that this method has high precision in object discovery and geotagging, with a GPS error within 2 meters, which is equivalent to the precision of a single - frequency GPS receiver. ### Specific Technical Details 1. **Object Segmentation**: Use the state - of - the - art fully convolutional neural network (FCNN) for semantic segmentation and output pixel - level labels for subsequent depth estimation. 2. **Monocular Depth Estimation**: Use a fully convolutional depth - estimation pipeline based on ResNet - 50 to estimate the distance from the camera to the object from a single image. 3. **Geotagging**: - **Single - View Localization**: Extract the geographical direction of the object relative to the camera from the segmentation map, and calculate the GPS position of the object by combining depth estimation. - **Multi - View Localization**: When the object is observed from multiple perspectives, solve the redundant detection problem through triangulation and MRF optimization, and finally obtain a consistent list of objects. ### MRF Model The MRF model is used to optimize object localization under multiple views and is defined by the following energy terms: - **Unary Energy Term**: Ensure the consistency between the triangulation distance and the depth estimation. - **Binary Energy Term**: Penalize the cases where multiple objects occlude each other and are overly dispersed. - **Ternary Energy Term**: Penalize rays without orthogonal points to reduce false detections. ### Experimental Results - **Traffic Lights**: Tests were carried out within a 0.8 - kilometer section of Regent Street in London. A total of 51 object instances were detected, of which 47 were accurate, with a recall rate of 0.922 and a precision rate of 0.922. - **Utility Poles**: Segmentation was carried out on a customized training data set, and the experimental results showed a high recall rate and precision rate. In conclusion, this paper proposes an effective method for automatically discovering and geotagging fixed objects using street - view images, which has broad application prospects.