Abstract:Equitable urban transportation applications require high-fidelity digital representations of the built environment: not just streets and sidewalks, but bike lanes, marked and unmarked crossings, curb ramps and cuts, obstructions, traffic signals, signage, street markings, potholes, and more. Direct inspections and manual annotations are prohibitively expensive at scale. Conventional machine learning methods require substantial annotated training data for adequate performance. In this paper, we consider vision language models as a mechanism for annotating diverse urban features from satellite images, reducing the dependence on human annotation to produce large training sets. While these models have achieved impressive results in describing common objects in images captured from a human perspective, their training sets are less likely to include strong signals for esoteric features in the built environment, and their performance in these settings is therefore unclear. We demonstrate proof-of-concept combining a state-of-the-art vision language model and variants of a prompting strategy that asks the model to consider segmented elements independently of the original image. Experiments on two urban features -- stop lines and raised tables -- show that while direct zero-shot prompting correctly annotates nearly zero images, the pre-segmentation strategies can annotate images with near 40% intersection-over-union accuracy. We describe how these results inform a new research agenda in automatic annotation of the built environment to improve equity, accessibility, and safety at broad scale and in diverse environments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to use Vision - Language Models (VLMs) to automatically label various features in the urban environment without manual annotation, so as to reduce the dependence on expensive and time - consuming manual annotation. Specifically, the paper aims to automatically identify and label urban infrastructure features such as parking lines and speed bumps from satellite images by combining the state - of - the - art vision - language models and image segmentation techniques, thereby improving the efficiency and scalability of urban data annotation. ### Problem Background 1. **Requirement for High - Fidelity Digital Representation**: - Urban traffic applications require high - precision digital representation, including not only roads and sidewalks, but also bicycle lanes, zebra crossings, ramps, traffic lights, signboards, pavement markings, potholes, etc. - Direct inspection and manual annotation of these features are costly and difficult to scale up on a large scale. 2. **Limitations of Existing Methods**: - Traditional machine - learning methods require a large amount of annotated training data to achieve good performance. - Although existing vision - language models perform well in describing common objects, their performance in dealing with less common features in the urban environment is unclear. ### Core Problems of the Paper The paper proposes a zero - sample annotation method based on vision - language models, aiming to solve the following problems: - **Reduce Dependence on Manual Annotation**: By using pre - trained vision - language models, reduce the need for a large amount of manually annotated data. - **Improve Annotation Efficiency and Accuracy**: By combining image segmentation techniques and visual cue strategies, improve the accuracy and efficiency of automatic annotation. - **Expand to Multiple Urban Features**: Explore how to apply this method to different types of urban infrastructure features, such as parking lines and speed bumps. ### Method Overview The paper proposes a zero - sample annotation process that combines vision - language models and image segmentation techniques. The specific steps are as follows: 1. **User Input**: Provide a pair (satellite image, annotation guidance). 2. **Image Segmentation**: Use a general - purpose segmentation model to segment the image into multiple candidate objects. 3. **Candidate Filtering**: Apply heuristic filters (such as color, area size, etc.) to eliminate irrelevant components and narrow the candidate space. 4. **Generate Set - of - Mark (SoM)**: Generate identifiers for each candidate object to enhance the recognition ability of the vision - language model. 5. **VLM Processing and Post - processing**: Input the generated SoM image and text guidance into the vision - language model, output the annotation results and perform post - processing. ### Experimental Results The paper verifies the effectiveness of this method through experiments. The main findings are as follows: - **Direct Prompting Method Fails**: The direct prompting method is almost completely ineffective in the annotation task, with almost no overlap with the real annotation. - **Visual Cues Significantly Improve Performance**: By introducing visual cues (such as No - Context, In - Context, and Combination), the annotation performance is significantly improved, and the average Intersection over Union (IoU) reaches about 40%. - **Influence of Different Prompting Strategies**: The In - Context prompt is more effective than the No - Context prompt, and providing both forms of prompts can further improve performance. ### Conclusions and Prospects The paper demonstrates the feasibility of using vision - language models for zero - sample annotation and points out some current challenges, such as segmentation quality, model understanding ability, and inconsistency of annotation results. Future research directions include developing segmentation models specifically for satellite images, improving the understanding ability of vision - language models, and improving the consistency and reliability of annotation results.

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

From Time to Space: Automatic Annotation of Unmarked Traffic Scene Based on Trajectory Data.

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Zero-shot Building Attribute Extraction from Large-Scale Vision and Language Models

Revolutionizing Urban Safety Perception Assessments: Integrating Multimodal Large Language Models with Street View Images

Zero-shot urban function inference with street view images through prompting a pretrained vision-language model

Mixed land use measurement and mapping with street view images and spatial context-aware prompts via zero-shot multimodal learning

Zero-shot detection of buildings in mobile LiDAR using Language Vision Model

Deep semantic-aware network for zero-shot visual urban perception

Migratable urban street scene sensing method based on vision language pre-trained model

Research on Human-Machine Collaborative Annotation for Traffic Scene Data

ELEV-VISION-SAM: Integrated Vision Language and Foundation Model for Automated Estimation of Building Lowest Floor Elevation

Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations

Exploration of an Open Vocabulary Model on Semantic Segmentation for Street Scene Imagery

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Zero-Shot Video Grounding for Automatic Video Understanding in Sustainable Smart Cities

Prompt-guided and multimodal landscape scenicness assessments with vision-language models

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

Vision-Language Models for Zero-Shot Classification of Remote Sensing Images