Abstract:This paper addresses the problem of on-road object importance estimation, which utilizes video sequences captured from the driver's perspective as the input. Although this problem is significant for safer and smarter driving systems, the exploration of this problem remains limited. On one hand, publicly-available large-scale datasets are scarce in the community. To address this dilemma, this paper contributes a new large-scale dataset named Traffic Object Importance (TOI). On the other hand, existing methods often only consider either bottom-up feature or single-fold guidance, leading to limitations in handling highly dynamic and diverse traffic scenarios. Different from existing methods, this paper proposes a model that integrates multi-fold top-down guidance with the bottom-up feature. Specifically, three kinds of top-down guidance factors (ie, driver intention, semantic context, and traffic rule) are integrated into our model. These factors are important for object importance estimation, but none of the existing methods simultaneously consider them. To our knowledge, this paper proposes the first on-road object importance estimation model that fuses multi-fold top-down guidance factors with bottom-up feature. Extensive experiments demonstrate that our model outperforms state-of-the-art methods by large margins, achieving 23.1% Average Precision (AP) improvement compared with the recently proposed model (ie, Goal).
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **on - road object importance estimation**. Specifically, the paper aims to use video sequences captured from the driver's perspective to evaluate the importance of objects on the road, in order to improve the safety and intelligence level of autonomous driving systems.
### Problem Background and Challenges
1. **Scarcity of Datasets**:
- Currently, publicly available large - scale datasets are very limited, especially datasets for the "on - road object importance estimation" task. Existing public datasets such as Ohn - Bar [33] are small in scale, containing only 3,187 frames, 8 scenes, and 16,076 object labels, which are difficult to support the training of complex models.
2. **Limitations of Existing Methods**:
- Most of the existing methods only consider bottom - up features or single - fold top - down guidance. These methods have limitations when dealing with highly dynamic and diverse traffic scenes and cannot fully take into account factors such as driver intentions, semantic contexts, and traffic rules.
### Main Contributions of the Paper
1. **New Dataset TOI**:
- The paper releases a new large - scale dataset - Traffic Object Importance (TOI), which contains 9,858 frames, 28 scenes, and 44,120 object labels. Compared with Ohn - Bar [33], TOI has increased by 3.1 times, 3.5 times, and 2.7 times in the number of frames, scenes, and objects respectively.
2. **Multi - layer Top - down Guidance Model**:
- A model that fuses multi - layer top - down guidance factors (driver intention, semantic context, traffic rule) and bottom - up features is proposed. This is the first on - road object importance estimation model that combines multi - layer top - down guidance factors and bottom - up features.
3. **Incorporating the Influence of Traffic Rules**:
- For the first time, the paper incorporates traffic rules into on - road object importance estimation and proposes an adaptive object - lane interaction mechanism, successfully modeling this abstract concept.
### Model Structure
The model consists of four key modules:
1. **Object Feature Extraction (OFE) Module**:
- Extract the spatial feature \( f_{o,s} \) and temporal feature \( f_{o,t} \) of the object.
2. **Driver Intention and Semantics Guidance (DISG) Module**:
- Combine the driver's intention and semantic context to generate the object - intention - semantic interaction feature \( f_{o - i - s} \).
3. **Traffic Rule Guidance (TRG) Module**:
- Model traffic rules to generate the object - lane interaction feature \( f_{o - l} \).
4. **Object Importance Estimation Module**:
- Use \( f_{o - i - s} \) and \( f_{o - l} \) to estimate the importance \( A \) of the object.
### Experimental Results
Through experiments on the public dataset [33] and the TOI dataset, it is proved that this model has significant advantages compared with existing methods, with the AP index increased by 23.1% and the F1 index also improved.
### Summary
By constructing a large - scale dataset and proposing a multi - layer top - down guidance model, this paper solves the problems of data scarcity and limitations of existing methods in on - road object importance estimation, providing strong support for safer and more intelligent driving systems.