Abstract:Existing Open Vocabulary Detection (OVD) models exhibit a number of challenges. They often struggle with semantic consistency across diverse inputs, and are often sensitive to slight variations in input phrasing, leading to inconsistent performance. The calibration of their predictive confidence, especially in complex multi-label scenarios, remains suboptimal, frequently resulting in overconfident predictions that do not accurately reflect their context understanding. To understand these limitations, multi-label detection benchmarks are needed. A particularly challenging domain for such benchmarking is social activities. Due to the lack of multi-label benchmarks for social interactions, in this work we present ELSA: Evaluating Localization of Social Activities. ELSA draws on theoretical frameworks in urban sociology and design and uses in-the-wild street-level imagery, where the size of groups and the types of activities vary significantly. ELSA includes more than 900 manually annotated images with more than 4,300 multi-labeled bounding boxes for individual and group activities. We introduce a novel confidence score computation method NLSE and a novel Dynamic Box Aggregation (DBA) algorithm to assess semantic consistency in overlapping predictions. We report our results on the widely-used SOTA models Grounding DINO, Detic, OWL, and MDETR. Our evaluation protocol considers semantic stability and localization accuracy and further exposes the limitations of existing approaches.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve multiple challenges faced by open - vocabulary detection (OVD) models when identifying and localizing social activities on urban streets. Specifically, the paper attempts to solve the following key problems: 1. **Semantic Consistency Problem**: - Existing OVD models often struggle to maintain semantic consistency when dealing with diverse inputs and are very sensitive to minor changes in input wording, resulting in unstable performance. 2. **Prediction Confidence Calibration Problem**: - OVD models have poor prediction confidence calibration in complex multi - label scenarios, often making over - confident predictions and failing to accurately reflect their understanding of the context. 3. **Lack of Appropriate Benchmark Datasets**: - The lack of multi - label detection benchmark datasets for social interactions hinders the evaluation and improvement of model performance in the real world, especially in dynamic environments such as urban streets. 4. **Limitations of Multi - label Detection**: - Existing methods have limitations in handling multi - label detection. For example, the standard average precision (AP) is easily affected by cross - category ranking changes, and the traditional non - maximum suppression (NMS) method cannot effectively handle overlapping predictions and inconsistent predictions. To solve these problems, the authors propose ELSA (Evaluating Localization of Social Activities), a new benchmark dataset and evaluation framework for evaluating the ability of OVD models to identify and localize human activities from static images. The main contributions of ELSA include: 1. **Providing a Comprehensive Multi - label Annotation Dataset**: - It contains 934 street - view images and more than 4,300 human - activity annotation boxes with 115 unique combinations, covering conditions, states, activities, and other information. 2. **Introducing a Novel Confidence Scoring Method N - LSE**: - Using the Normalized Log - Sum - Exp (N - LSE) function to calculate more representative confidence scores, reducing bias towards common categories and increasing attention to subtle attributes. 3. **Proposing a Dynamic Box Aggregation Algorithm DBA**: - The DBA algorithm processes overlapping predictions by considering confidence scores and semantic consistency, avoiding problems existing in the traditional NMS method, such as false suppression of true - positive predictions and failure to expose the model's vulnerability in understanding targets. These innovations solve key challenges in OVD model evaluation and provide valuable tools for future research.

ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes

Exploration of an Open Vocabulary Model on Semantic Segmentation for Street Scene Imagery

Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection

Revolutionizing Urban Safety Perception Assessments: Integrating Multimodal Large Language Models with Street View Images

OSMLoc: Single Image-Based Visual Localization in OpenStreetMap with Geometric and Semantic Guidances

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community

OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

Remote Sensing and Deep Learning to Understand Noisy OpenStreetMap

Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

OVO: Open-Vocabulary Occupancy

OV-VG: A benchmark for open-vocabulary visual grounding

Deploying machine learning to assist digital humanitarians: making image annotation in OpenStreetMap more efficient

Language Driven Occupancy Prediction

Enriching building function classification using Large Language Model embeddings of OpenStreetMap Tags

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects