Abstract:In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features, while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing state-of-the-art (SOTA) model TF-GridNet. This is the first speech separation model with fewer than 1 million parameters that achieves performance comparable to the SOTA model.

What problem does this paper attempt to address?

There are two main problems that this paper attempts to solve: ### 1. Improve the computational efficiency and lightweight of the speech separation model Although the existing speech separation models perform well in terms of performance, they often require a large number of parameters and computational resources, and are difficult to be used in practical application scenarios with low latency and limited computational resources. Specifically: - **High computational complexity**: Many existing models (such as TF - GridNet) rely on bidirectional LSTMs and self - attention mechanisms. Although they have superior performance, the computational cost is huge. - **Many model parameters**: These models usually contain a large number of parameters, making it difficult to be applied in real - time processing scenarios such as edge devices. To solve these problems, the paper proposes a new model named TIGER (Time - frequency Interleaved Gain Extraction and Reconstruction network). TIGER improves computational efficiency and reduces model parameters in the following ways: - **Frequency band segmentation strategy**: Use prior knowledge to segment frequency bands and compress frequency information. - **Multi - scale selective attention module (MSA)**: Extract context features. - **Full - frequency - frame attention module (F3A)**: Capture context information in the time domain and frequency domain. The experimental results show that on the EchoSet dataset, TIGER reduces the number of parameters by 94.3% and MACs by 95.3%, while outperforming the existing SOTA model TF - GridNet. ### 2. Provide a speech separation dataset closer to the real world The existing speech separation datasets (such as WSJ0 - 2mix, WHAM!, Libri2Mix, etc.) have a large gap from practical application scenarios, mainly manifested in the following aspects: - **Lack of noise and reverberation**: Many datasets only contain clean audio and do not consider noise and reverberation. - **Fixed overlap ratio**: The voices of different speakers are completely overlapped or the overlap ratio is fixed, which does not conform to the actual situation. - **Single acoustic environment**: Fail to fully consider the influence of factors such as room shape and material properties on reverberation. To solve these problems, the paper introduces a new dataset, EchoSet. The characteristics of EchoSet include: - **Diverse background noise**: It contains multiple types of background noise. - **Realistic reverberation**: Consider room shape and material properties to generate more realistic reverberation effects. - **Random overlap ratio**: The voices of two speakers are overlapped at a random ratio. By training the model on EchoSet, the paper verifies that it has stronger generalization ability on real - world data, further proving the effectiveness and practicality of EchoSet. ### Summary This paper aims to solve the problems of low computational efficiency, excessive parameters of existing speech separation models and large gaps between datasets and real - world scenarios by proposing the TIGER model and the EchoSet dataset. The experimental results show that TIGER not only significantly reduces computational efficiency and the number of parameters, but also performs well in complex acoustic environments and has important practical application value.

TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation