Abstract:Sound Event Detection (SED) needs to identify the sound events in a recording and detect the onset and offset times of them. The former desires features with long short-term dependencies to detect sound events with different durations and the latter needs fine-grained dependency. Although our previous proposed Multi-Scale Fully Convolutional Networks (MS-FCN) uses cascaded dilated convolution to model temporal context information and multi-scale information is considered, there are two shortages to deal with: the ignorance of neighboring information and fine-grained dependencies, and neglecting intermediate-length temporal dependencies. The first shortage is caused by the skipping elements sampling mechanism of dilated convolution, by which the neighboring information and fine-grained dependencies are ignored. To overcome this shortage, the paper proposes the dilated mixed convolution module, which mixes dilated convolution and standard convolutions to capture both the fine-grained and long-term dependencies and give weight to neighboring information. The second shortage is caused by the too fast increase of temporal dependent length in cascaded dilated convolution module, which causes too much intermediate temporal information to be ignored. For this shortage, this paper proposes Dilated Temporal Pyramid Pooling module (DTPP), in which parallel dilated convolutions with multiple dilation factors are used to capture the intermediate temporal information with a proper temporal dependent length. As cascaded module has been demonstrated to be valid and efficient to model the temporal context in MS-FCN and DTPP module can capture the ignored temporal information of cascaded module, taking the advantages of both, this paper proposes the cascaded parallel module to capture richer temporal dependencies. Based on that, Multi-Scale Feature Fusion Networks (MSFF-Net) is proposed, which obtains competitive performance on three open datasets.

NAS-DYMC: NAS-Based Dynamic Multi-Scale Convolutional Neural Network for Sound Event Detection

Sound Event Detection Using Multi-Scale Dense Convolutional Recurrent Neural Network with Lightweight Attention

Multi-Scale Convolutional Recurrent Neural Network with Ensemble Method for Weakly Labeled Sound Event Detection

MULTI-SCALE CONVOLUTION BASED ATTENTION NETWORK FOR SEMI-SUPERVISED SOUND EVENT DETECTION Technical Report

Attention mechanism combined with residual recurrent neural network for sound event detection and localization

Multi-Scale Recurrent Neural Network for Sound Event Detection

Multi-Scale and Single-Scale Fully Convolutional Networks for Sound Event Detection

Non-Negative Matrix Factorization-Convolutional Neural Network (NMF-CNN) For Sound Event Detection

MTF-CRNN: Multiscale Time-Frequency Convolutional Recurrent Neural Network for Sound Event Detection.

Multi-scale Convolutional Recurrent Neural Network and Data Augmentation for Polyphonic Sound Event Detection

R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection

Convolutional Recurrent Neural Networks with Multi-Sized Convolution Filters for Sound-Event Recognition

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition

Acoustic Scene Classification Based on Dense Convolutional Networks Incorporating Multi-channel Features

Robust Sound Event Recognition Using Convolutional Neural Networks

Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features

Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network

Cascaded Contextual Region-based Convolutional Neural Network for Event Detection from Time Series Signals: A Seismic Application.

Sound Event Localization and Detection Using Element-Wise Attention Gate and Asymmetric Convolutional Recurrent Neural Networks.

MSFF-Net: Multi-scale Feature Fusing Networks with Dilated Mixed Convolution and Cascaded Parallel Framework for Sound Event Detection.