Abstract:The increasing use of compact UAVs has created significant threats to public safety, while traditional drone detection systems are often bulky and costly. To address these challenges, we propose AV-DTEC, a lightweight self-supervised audio-visual fusion-based anti-UAV system. AV-DTEC is trained using self-supervised learning with labels generated by LiDAR, and it simultaneously learns audio and visual features through a parallel selective state-space model. With the learned features, a specially designed plug-and-play primary-auxiliary feature enhancement module integrates visual features into audio features for better robustness in cross-lighting conditions. To reduce reliance on auxiliary features and align modalities, we propose a teacher-student model that adaptively adjusts the weighting of visual features. AV-DTEC demonstrates exceptional accuracy and effectiveness in real-world multi-modality data. The code and trained models are publicly accessible on GitHub \url{<a class="link-external link-https" href="https://github.com/AmazingDay1/AV-DETC" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: With the widespread use of compact unmanned aerial vehicles (UAVs), it poses a significant threat to public safety, while traditional UAV detection systems are often bulky and costly. To address these challenges, the authors propose AV - DTEC, a lightweight self - supervised audio - video fusion anti - UAV system. Specifically, the paper aims to solve the following problems: 1. **Limitations of traditional UAV detection systems**: - Traditional methods rely on single - modality detectors (such as vision, audio or LiDAR), but these methods have inherent limitations. For example, visual detection is prone to failure under illumination changes, audio detection is easily interfered by noise, and LiDAR has difficulty dealing with occlusion problems. 2. **Challenges of multi - modality fusion**: - Although some methods make up for the shortcomings of single - modality by fusing multiple modalities, these methods usually require a large amount of labeled data, which is very difficult in practical applications. 3. **Improving the accuracy and robustness of UAV trajectory estimation and classification**: - The paper proposes a new self - supervised audio - video fusion method to achieve more efficient and robust UAV trajectory estimation and classification, especially maintaining high precision under different illumination conditions. ### Solutions To solve the above problems, the paper proposes the AV - DTEC system, whose main features include: - **Self - supervised learning**: Use labels generated by LiDAR for self - supervised training, avoiding the dependence on a large amount of labeled data. - **Selective state - space model (SSM)**: Simultaneously learn audio and visual features through a parallel selective state - space model. - **Feature enhancement module**: Design a plug - in master - auxiliary feature enhancement module to integrate visual features into audio features to improve robustness under different illumination conditions. - **Teacher - student model**: Introduce a teacher - student model to adaptively adjust the weights of visual features, reduce the dependence on auxiliary features, and align different modalities. ### Summary The AV - DTEC system realizes efficient UAV trajectory estimation and classification through self - supervised audio - video fusion, and is especially suitable for resource - constrained scenarios, such as the detection and defense of small UAVs. This system not only improves the detection accuracy, but also performs well under different illumination conditions, and has important practical application value.

AV-DTEC: Self-Supervised Audio-Visual Fusion for Drone Trajectory Estimation and Classification

Adaptive Switching Spatial-Temporal Fusion Detection for Remote Flying Drones

A Small UAV Detection Method Based on Optical Flow and Visual Feature Fusion

Real-Time Drone Detection Using Deep Learning Approach.

A Deep Learning Approach for Drone Detection and Classification Using Radar and Camera Sensor Fusion

Drone Detection and Tracking System Based on Fused Acoustical and Optical Approaches

AF-DETR: efficient UAV small object detector via Assemble-and-Fusion mechanism

Real-Time Multi-Modal Active Vision for Object Detection on UAVs Equipped With Limited Field of View LiDAR and Camera

Modality Meets Long-Term Tracker: A Siamese Dual Fusion Framework for Tracking UAV

AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness

Deformable Convolution-Guided Multiscale Feature Learning and Fusion for UAV Object Detection

Learnable Cross-Scale Sparse Attention Guided Feature Fusion for UAV Object Detection

Long-Tailed 3D Detection via Multi-Modal Fusion

Aerial Monocular 3D Object Detection

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images

MMAUD: A Comprehensive Multi-Modal Anti-UAV Dataset for Modern Miniature Drone Threats

Towards Visible and Thermal Drone Monitoring with Convolutional Neural Networks

Lightweight UAV Object-Detection Method Based on Efficient Multidimensional Global Feature Adaptive Fusion and Knowledge Distillation

Enhancing UAV Detection in Surveillance Camera Videos through Spatiotemporal Information and Optical Flow

Delving into Robust Object Detection from Unmanned Aerial Vehicles: A Deep Nuisance Disentanglement Approach