Abstract:Bird sound serves as a crucial means of acoustic communication for birds, and its classification research is conducive to the protection, health, and diversity of the ecological ecosystems. Using various feature extraction methods to extract multi-view features can provide more comprehensive information about bird sound, which is a potential method to improve the accuracy of bird sound classification. However, efficiently fusing multi-view features to identify birds accurately remains a challenging task. To address this problem, this paper presents an efficient bird sound classification framework called MDF-Net. The approach extracts four acoustic features from bird sound audios, including wavelet transform spectrogram, Hilbert-Huang transform spectrogram, short-time Fourier transform spectrogram, and Mel-frequency cepstral coefficients, to fully describe the characteristics of bird sound from different views. Subsequently, convolutional neural network is used as advanced feature extractor to obtain deep features of these spectrograms. Then, the multi-head self-attention mechanism focuses on the correlation and importance of different features in each view to obtain essential and expressive feature representations. And the cross-attention mechanism is employed to align and correlate information in the four views, which makes it easier for the classifier to understand the relationships between features of different views. Finally, combined with the results of the dual-attention mechanism, a multi-view fusion feature with difference and diversity is constructed, and it applied to the bird sound classification. In this study, audios from16 bird species constitute the dataset. The multi-view fusion feature based on MDF-Net achieved a classification accuracy of 97.29%, outperformed the 9 single features and 3 fused features used in the experiments. The result demonstrate that the proposed MDF-Net successfully captures the feature relationships within single-view and between multi-view, providing crucial information for correctly classifying bird sound samples. The approach efficiently fuses the features of different views and improves the performance of bird sound classification.

A Multi-level Attention Fusion Network for Weakly Supervised Audio Classification

Multi-level Attention Model for Weakly Supervised Audio Classification

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Deep Mutual Attention Network for Acoustic Scene Classification

Multi-Attention Audio-Visual Fusion Network for Audio Spatialization

AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

MDF-Net: A Multi-View Dual-Attention Fusion Network for Efficient Bird Sound Classification

MAF-Net: Multidimensional Attention Fusion Network for Multichannel Speech Separation

A Deep Neural Network for Audio Classification with a Classifier Attention Mechanism

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Weakly Labelled AudioSet Tagging With Attention Neural Networks

Audio Classification Using Attention-Augmented Convolutional Neural Network

Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition

Audio-Based Music Classification with DenseNet And Data Augmentation

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

Multi-stream Network With Temporal Attention For Environmental Sound Classification

Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion