Abstract:This paper addresses the problem of on-road object importance estimation, which utilizes video sequences captured from the driver's perspective as the input. Although this problem is significant for safer and smarter driving systems, the exploration of this problem remains limited. On one hand, publicly-available large-scale datasets are scarce in the community. To address this dilemma, this paper contributes a new large-scale dataset named Traffic Object Importance (TOI). On the other hand, existing methods often only consider either bottom-up feature or single-fold guidance, leading to limitations in handling highly dynamic and diverse traffic scenarios. Different from existing methods, this paper proposes a model that integrates multi-fold top-down guidance with the bottom-up feature. Specifically, three kinds of top-down guidance factors (ie, driver intention, semantic context, and traffic rule) are integrated into our model. These factors are important for object importance estimation, but none of the existing methods simultaneously consider them. To our knowledge, this paper proposes the first on-road object importance estimation model that fuses multi-fold top-down guidance factors with bottom-up feature. Extensive experiments demonstrate that our model outperforms state-of-the-art methods by large margins, achieving 23.1% Average Precision (AP) improvement compared with the recently proposed model (ie, Goal).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **on - road object importance estimation**. Specifically, the paper aims to use video sequences captured from the driver's perspective to evaluate the importance of objects on the road, in order to improve the safety and intelligence level of autonomous driving systems. ### Problem Background and Challenges 1. **Scarcity of Datasets**: - Currently, publicly available large - scale datasets are very limited, especially datasets for the "on - road object importance estimation" task. Existing public datasets such as Ohn - Bar [33] are small in scale, containing only 3,187 frames, 8 scenes, and 16,076 object labels, which are difficult to support the training of complex models. 2. **Limitations of Existing Methods**: - Most of the existing methods only consider bottom - up features or single - fold top - down guidance. These methods have limitations when dealing with highly dynamic and diverse traffic scenes and cannot fully take into account factors such as driver intentions, semantic contexts, and traffic rules. ### Main Contributions of the Paper 1. **New Dataset TOI**: - The paper releases a new large - scale dataset - Traffic Object Importance (TOI), which contains 9,858 frames, 28 scenes, and 44,120 object labels. Compared with Ohn - Bar [33], TOI has increased by 3.1 times, 3.5 times, and 2.7 times in the number of frames, scenes, and objects respectively. 2. **Multi - layer Top - down Guidance Model**: - A model that fuses multi - layer top - down guidance factors (driver intention, semantic context, traffic rule) and bottom - up features is proposed. This is the first on - road object importance estimation model that combines multi - layer top - down guidance factors and bottom - up features. 3. **Incorporating the Influence of Traffic Rules**: - For the first time, the paper incorporates traffic rules into on - road object importance estimation and proposes an adaptive object - lane interaction mechanism, successfully modeling this abstract concept. ### Model Structure The model consists of four key modules: 1. **Object Feature Extraction (OFE) Module**: - Extract the spatial feature \( f_{o,s} \) and temporal feature \( f_{o,t} \) of the object. 2. **Driver Intention and Semantics Guidance (DISG) Module**: - Combine the driver's intention and semantic context to generate the object - intention - semantic interaction feature \( f_{o - i - s} \). 3. **Traffic Rule Guidance (TRG) Module**: - Model traffic rules to generate the object - lane interaction feature \( f_{o - l} \). 4. **Object Importance Estimation Module**: - Use \( f_{o - i - s} \) and \( f_{o - l} \) to estimate the importance \( A \) of the object. ### Experimental Results Through experiments on the public dataset [33] and the TOI dataset, it is proved that this model has significant advantages compared with existing methods, with the AP index increased by 23.1% and the F1 index also improved. ### Summary By constructing a large - scale dataset and proposing a multi - layer top - down guidance model, this paper solves the problems of data scarcity and limitations of existing methods in on - road object importance estimation, providing strong support for safer and more intelligent driving systems.

On-Road Object Importance Estimation: A New Dataset and A Model with Multi-Fold Top-Down Guidance

Goal-oriented Object Importance Estimation in On-road Driving Videos

A Fusion Method Aiming at Environmental Perception of Autonomous Vehicle Based on Visual Scheme

Research on Autonomous Driving Image Recognition Based on a New Real-Time Object Detection Model YOLOv5st

Real-time Joint Traffic State and Model Parameter Estimation on Freeways with Fixed Sensors and Connected Vehicles: State-of-the-art Overview, Methods, and Case Studies

Object Importance Estimation using Counterfactual Reasoning for Intelligent Driving

A Joint Object Detection and Semantic Segmentation Model with Cross-Attention and Inner-Attention Mechanisms

IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic

A novel real-time object detection method for complex road scenes based on YOLOv7-tiny

High-resolution large-scale urban traffic speed estimation with multi-source crowd sensing data

Epurate-Net: Efficient Progressive Uncertainty Refinement Analysis for Traffic Environment Urban Road Detection

A Comprehensive Implementation of Road Surface Classification for Vehicle Driving Assistance: Dataset, Models, and Deployment

Research on Road Scene Understanding of Autonomous Vehicles Based on Multi-Task Learning

DRUformer: Enhancing Driving Scene Important Object Detection With Driving Scene Relationship Understanding

Research on multitask model of object detection and road segmentation in unstructured road scenes

OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping

Multi-scale feature fusion with attention mechanism for crowded road object detection

Toward Driving Scene Understanding: A Paradigm and Benchmark Dataset for Ego-Centric Traffic Scene Graph Representation

Visual-based On-Road Vehicle Detection: A Transnational Experiment and Comparison

Multi-Intersection Traffic Optimisation: A Benchmark Dataset and a Strong Baseline