ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection

Maryam Hosseini,Marco Cipriano,Sedigheh Eslami,Daniel Hodczak,Liu Liu,Andres Sevtsuk,Gerard de Melo
2024-12-05
Abstract:Existing Open Vocabulary Detection (OVD) models exhibit a number of challenges. They often struggle with semantic consistency across diverse inputs, and are often sensitive to slight variations in input phrasing, leading to inconsistent performance. The calibration of their predictive confidence, especially in complex multi-label scenarios, remains suboptimal, frequently resulting in overconfident predictions that do not accurately reflect their context understanding. To understand these limitations, multi-label detection benchmarks are needed. A particularly challenging domain for such benchmarking is social activities. Due to the lack of multi-label benchmarks for social interactions, in this work we present ELSA: Evaluating Localization of Social Activities. ELSA draws on theoretical frameworks in urban sociology and design and uses in-the-wild street-level imagery, where the size of groups and the types of activities vary significantly. ELSA includes more than 900 manually annotated images with more than 4,300 multi-labeled bounding boxes for individual and group activities. We introduce a novel confidence score computation method NLSE and a novel Dynamic Box Aggregation (DBA) algorithm to assess semantic consistency in overlapping predictions. We report our results on the widely-used SOTA models Grounding DINO, Detic, OWL, and MDETR. Our evaluation protocol considers semantic stability and localization accuracy and further exposes the limitations of existing approaches.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve multiple challenges faced by open - vocabulary detection (OVD) models when identifying and localizing social activities on urban streets. Specifically, the paper attempts to solve the following key problems: 1. **Semantic Consistency Problem**: - Existing OVD models often struggle to maintain semantic consistency when dealing with diverse inputs and are very sensitive to minor changes in input wording, resulting in unstable performance. 2. **Prediction Confidence Calibration Problem**: - OVD models have poor prediction confidence calibration in complex multi - label scenarios, often making over - confident predictions and failing to accurately reflect their understanding of the context. 3. **Lack of Appropriate Benchmark Datasets**: - The lack of multi - label detection benchmark datasets for social interactions hinders the evaluation and improvement of model performance in the real world, especially in dynamic environments such as urban streets. 4. **Limitations of Multi - label Detection**: - Existing methods have limitations in handling multi - label detection. For example, the standard average precision (AP) is easily affected by cross - category ranking changes, and the traditional non - maximum suppression (NMS) method cannot effectively handle overlapping predictions and inconsistent predictions. To solve these problems, the authors propose ELSA (Evaluating Localization of Social Activities), a new benchmark dataset and evaluation framework for evaluating the ability of OVD models to identify and localize human activities from static images. The main contributions of ELSA include: 1. **Providing a Comprehensive Multi - label Annotation Dataset**: - It contains 934 street - view images and more than 4,300 human - activity annotation boxes with 115 unique combinations, covering conditions, states, activities, and other information. 2. **Introducing a Novel Confidence Scoring Method N - LSE**: - Using the Normalized Log - Sum - Exp (N - LSE) function to calculate more representative confidence scores, reducing bias towards common categories and increasing attention to subtle attributes. 3. **Proposing a Dynamic Box Aggregation Algorithm DBA**: - The DBA algorithm processes overlapping predictions by considering confidence scores and semantic consistency, avoiding problems existing in the traditional NMS method, such as false suppression of true - positive predictions and failure to expose the model's vulnerability in understanding targets. These innovations solve key challenges in OVD model evaluation and provide valuable tools for future research.