Abstract:Sound Event Detection (SED) needs to identify the sound events in a recording and detect the onset and offset times of them. The former desires features with long short-term dependencies to detect sound events with different durations and the latter needs fine-grained dependency. Although our previous proposed Multi-Scale Fully Convolutional Networks (MS-FCN) uses cascaded dilated convolution to model temporal context information and multi-scale information is considered, there are two shortages to deal with: the ignorance of neighboring information and fine-grained dependencies, and neglecting intermediate-length temporal dependencies. The first shortage is caused by the skipping elements sampling mechanism of dilated convolution, by which the neighboring information and fine-grained dependencies are ignored. To overcome this shortage, the paper proposes the dilated mixed convolution module, which mixes dilated convolution and standard convolutions to capture both the fine-grained and long-term dependencies and give weight to neighboring information. The second shortage is caused by the too fast increase of temporal dependent length in cascaded dilated convolution module, which causes too much intermediate temporal information to be ignored. For this shortage, this paper proposes Dilated Temporal Pyramid Pooling module (DTPP), in which parallel dilated convolutions with multiple dilation factors are used to capture the intermediate temporal information with a proper temporal dependent length. As cascaded module has been demonstrated to be valid and efficient to model the temporal context in MS-FCN and DTPP module can capture the ignored temporal information of cascaded module, taking the advantages of both, this paper proposes the cascaded parallel module to capture richer temporal dependencies. Based on that, Multi-Scale Feature Fusion Networks (MSFF-Net) is proposed, which obtains competitive performance on three open datasets.

MSFF-Net: Multi-scale Feature Fusing Networks with Dilated Mixed Convolution and Cascaded Parallel Framework for Sound Event Detection.

Multi-Scale and Single-Scale Fully Convolutional Networks for Sound Event Detection

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

MULTI-SCALE CONVOLUTION BASED ATTENTION NETWORK FOR SEMI-SUPERVISED SOUND EVENT DETECTION Technical Report

Improved Self-Consistency Training with Selective Feature Fusion for Sound Event Detection

Attention mechanism combined with residual recurrent neural network for sound event detection and localization

Multi-dimensional frequency dynamic convolution with confident mean teacher for sound event detection

Sound Event Detection Using Multi-Scale Dense Convolutional Recurrent Neural Network with Lightweight Attention

MFFNet: Multi-modal Feature Fusion Network for V-D-T Salient Object Detection

Multi-Scale Recurrent Neural Network for Sound Event Detection

A Joint Detection-Classification Model for Weakly Supervised Sound Event Detection Using Multi-Scale Attention Method

Hierarchical-Concatenate Fusion TDNN for sound event classification

Multi-frame Concatenation for Detection of Rare Sound Events Based on Deep Neural Network

Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

A Multi-grained based Attention Network for Semi-supervised Sound Event Detection

Pushing the Limit of Sound Event Detection with Multi-Dilated Frequency Dynamic Convolution

Multi-Scale Feature Fusion Transformer Network for End-to-End Single Channel Speech Separation

Infrasound Event Classification Fusion Model Based on Multiscale SE-CNN and BiLSTM

Decoupling Temporal Convolutional Networks Model in Sound Event Detection and Localization

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection