Abstract:Collaborative perception in autonomous driving significantly enhances the perception capabilities of individual agents. Immutable heterogeneity in collaborative perception, where agents have different and fixed perception networks, presents a major challenge due to the semantic gap in their exchanged intermediate features without modifying the perception networks. Most existing methods bridge the semantic gap through interpreters. However, they either require training a new interpreter for each new agent type, limiting extensibility, or rely on a two-stage interpretation via an intermediate standardized semantic space, causing cumulative semantic loss. To achieve both extensibility in immutable heterogeneous scenarios and low-loss feature interpretation, we propose PolyInter, a polymorphic feature interpreter. It contains an extension point through which emerging new agents can seamlessly integrate by overriding only their specific prompts, which are learnable parameters intended to guide the interpretation, while reusing PolyInter's remaining parameters. By leveraging polymorphism, our design ensures that a single interpreter is sufficient to accommodate diverse agents and interpret their features into the ego agent's semantic space. Experiments conducted on the OPV2V dataset demonstrate that PolyInter improves collaborative perception precision by up to 11.1% compared to SOTA interpreters, while comparable results can be achieved by training only 1.4% of PolyInter's parameters when adapting to new agents.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the challenges in **immutable heterogeneous collaborative perception**. Specifically, the paper focuses on the problem in the autonomous driving scenario that, due to different manufacturers or models, the perception network structures of different vehicles are fixed and diverse, resulting in the intermediate features exchanged between them being too heterogeneous and difficult to be understood by other vehicles. This heterogeneity exists not only at the semantic level but also involves differences in feature size and distribution. To solve this problem, existing methods usually rely on interpreters, but these methods have the following limitations: 1. **Poor scalability**: For each new type of agent, a new interpreter needs to be retrained, which limits the scalability of the system. 2. **Cumulative semantic loss**: Through two - stage interpretation (that is, first converting features to a standard semantic space and then from the standard space to the target semantic space), it will lead to cumulative semantic loss. For this reason, the paper proposes a polymorphic feature interpreter named **PolyInter**. The main contributions of PolyInter are: - **High scalability**: By inheriting the trained interpreters and only adjusting the specific prompts related to the new neighbor agents, the system can seamlessly integrate new types of agents. - **Low semantic loss**: Through the Channel Selection Module and the Spatial Attention Module, it ensures that the semantic loss is minimized during the single - stage interpretation process. ### How PolyInter works The design of PolyInter is based on two stages: 1. **Base Model Training**: Existing agents jointly train the interpreter network, the general prompt, and the specific prompts for each neighbor agent. The interpreter network is responsible for interpreting the semantics of neighbor agents into the semantic space of the ego agent, and the prompts guide feature interpretation. 2. **Generalization**: When a new neighbor agent is added, only the specific prompts related to the new agent need to be fine - tuned, while keeping the interpreter network and other parameters unchanged. This enables the system to adapt to new agent types without retraining the entire interpreter. In this way, PolyInter achieves efficient immutable heterogeneous collaborative perception, significantly improves the accuracy of collaborative perception, and reduces the need for parameter fine - tuning. ### Experimental results The experimental results show that PolyInter outperforms the existing state - of - the - art immutable heterogeneous feature interpreters (such as PnPDA and MPDA) on the OPV2V dataset. Specifically, PolyInter improves the average precision (AP) of collaborative perception by 7.9% (IoU = 0.5) and 11.1% (IoU = 0.7) respectively, and only needs to train about 1.4% of the parameters when adapting to new agents. In conclusion, this paper solves the scalability and semantic loss problems in immutable heterogeneous collaborative perception by proposing PolyInter, providing a more efficient and accurate solution for collaborative perception in autonomous driving.

One is Plenty: A Polymorphic Feature Interpreter for Immutable Heterogeneous Collaborative Perception

Collaborative Joint Perception and Prediction for Autonomous Driving

DeepInteraction++: Multi-Modality Interaction for Autonomous Driving

An Extensible Framework for Open Heterogeneous Collaborative Perception

HM-ViT: Hetero-modal Vehicle-to-Vehicle Cooperative perception with vision transformer

Inferring Intents From Equivariant–Invariant Representations and Relational Learning in Multiagent Systems

ERMVP: Communication-Efficient and Collaboration-Robust Multi-Vehicle Perception in Challenging Environments

What2comm: Towards Communication-efficient Collaborative Perception Via Feature Decoupling

How2comm: Communication-Efficient and Collaboration-Pragmatic Multi-Agent Perception

InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding

Towards Full-scene Domain Generalization in Multi-agent Collaborative Bird's Eye View Segmentation for Connected and Autonomous Driving

Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving

Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles

Social Occlusion Inference with Vectorized Representation for Autonomous Driving

Language-guided Adaptive Perception with Hierarchical Symbolic Representations for Mobile Manipulators

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Pragmatic Communication in Multi-Agent Collaborative Perception

Where2comm: Communication-Efficient Collaborative Perception Via Spatial Confidence Maps

A Survey of Collaborative Perception in Intelligent Vehicles at Intersections