Urban Visual Localization of Block-Wise Monocular Images with Google Street Views

Zhixin Li,Shuang Li,John Anderson,Jie Shan
DOI: https://doi.org/10.3390/rs16050801
IF: 5
2024-02-26
Remote Sensing
Abstract:Urban visual localization is the process of determining the pose (position and attitude) of the imaging sensor (or platform) with the help of existing geo-referenced data. This task is critical and challenging for many applications, such as autonomous navigation, virtual and augmented reality, and robotics, due to the dynamic and complex nature of urban environments that may obstruct Global Navigation Satellite Systems (GNSS) signals. This paper proposes a block-wise matching strategy for urban visual localization by using geo-referenced Google Street View (GSV) panoramas as the database. To determine the pose of the monocular query images collected from a moving vehicle, neighboring GSVs should be found to establish the correspondence through image-wise and block-wise matching. First, each query image is semantically segmented and a template containing all permanent objects is generated. The template is then utilized in conjunction with a template matching approach to identify the corresponding patch from each GSV image within the database. Through the conversion of the query template and corresponding GSV patch into feature vectors, their image-wise similarity is computed pairwise. To ensure reliable matching, the query images are temporally grouped into query blocks, while the GSV images are spatially organized into GSV blocks. By using the previously computed image-wise similarities, we calculate a block-wise similarity for each query block with respect to every GSV block. A query block and its corresponding GSV blocks of top-ranked similarities are then input into a photogrammetric triangulation or structure from motion process to determine the pose of every image in the query block. A total of three datasets, consisting of two public ones and one newly collected on the Purdue campus, are utilized to demonstrate the performance of the proposed method. It is shown it can achieve a meter-level positioning accuracy and is robust to changes in acquisition conditions, such as image resolution, scene complexity, and the time of day.
environmental sciences,imaging science & photographic technology,remote sensing,geosciences, multidisciplinary
What problem does this paper attempt to address?
The paper aims to address the problem of visual localization using monocular images in urban environments. Specifically, the research focuses on determining the position and pose (i.e., the camera's pose) of a sequence of monocular images captured on a moving vehicle in complex urban environments, especially in situations where Global Navigation Satellite System (GNSS) signals are weak or unavailable. To achieve this goal, the paper proposes a block-level matching strategy based on Google Street View (GSV) images. First, permanent object templates in each query image are generated through semantic segmentation, and then the Quality-Aware Template Matching (QATM) method is used to find the corresponding parts of the template in each GSV image from the database. Next, the templates and corresponding GSV image parts are converted into feature vectors using a pre-trained Contrastive Language-Image Pretraining (CLIP) model, and their cosine similarity is calculated to assess the image-level matching degree. Additionally, the paper proposes a block-level matching method, where query images are grouped into query blocks based on acquisition time, and GSV images are grouped into GSV blocks based on geographic location. Using the previously calculated image-level similarity, the block-level similarity between each query block and each GSV block can be further calculated. The GSV block with the highest block-level similarity is selected as the best match and is input into the photogrammetric triangulation or Structure from Motion (SfM) process along with the query block to determine the position and pose of each image in the query image sequence. The paper conducts experimental validation using three datasets, including two public datasets and a newly collected dataset located on the Purdue University campus, demonstrating that the proposed method can achieve meter-level localization accuracy and is robust to changes in acquisition conditions (such as image resolution, scene complexity, and time of day).