MSFF-Net: Multi-scale Feature Fusing Networks with Dilated Mixed Convolution and Cascaded Parallel Framework for Sound Event Detection.

Yingbin Wang,Guanghui Zhao,Kai Xiong,Guangming Shi
DOI: https://doi.org/10.1016/j.dsp.2021.103319
IF: 2.92
2021-01-01
Digital Signal Processing
Abstract:Sound Event Detection (SED) needs to identify the sound events in a recording and detect the onset and offset times of them. The former desires features with long short-term dependencies to detect sound events with different durations and the latter needs fine-grained dependency. Although our previous proposed Multi-Scale Fully Convolutional Networks (MS-FCN) uses cascaded dilated convolution to model temporal context information and multi-scale information is considered, there are two shortages to deal with: the ignorance of neighboring information and fine-grained dependencies, and neglecting intermediate-length temporal dependencies. The first shortage is caused by the skipping elements sampling mechanism of dilated convolution, by which the neighboring information and fine-grained dependencies are ignored. To overcome this shortage, the paper proposes the dilated mixed convolution module, which mixes dilated convolution and standard convolutions to capture both the fine-grained and long-term dependencies and give weight to neighboring information. The second shortage is caused by the too fast increase of temporal dependent length in cascaded dilated convolution module, which causes too much intermediate temporal information to be ignored. For this shortage, this paper proposes Dilated Temporal Pyramid Pooling module (DTPP), in which parallel dilated convolutions with multiple dilation factors are used to capture the intermediate temporal information with a proper temporal dependent length. As cascaded module has been demonstrated to be valid and efficient to model the temporal context in MS-FCN and DTPP module can capture the ignored temporal information of cascaded module, taking the advantages of both, this paper proposes the cascaded parallel module to capture richer temporal dependencies. Based on that, Multi-Scale Feature Fusion Networks (MSFF-Net) is proposed, which obtains competitive performance on three open datasets.
What problem does this paper attempt to address?