Enhanced Deep Speech Separation in Clustered Ad Hoc Distributed Microphone Environments

Jihyun Kim,Stijn Kindt,Nilesh Madhu,Hong-Goo Kang
2024-06-14
Abstract:Ad-hoc distributed microphone environments, where microphone locations and numbers are unpredictable, present a challenge to traditional deep learning models, which typically require fixed architectures. To tailor deep learning models to accommodate arbitrary array configurations, the Transform-Average-Concatenate (TAC) layer was previously introduced. In this work, we integrate TAC layers with dual-path transformers for speech separation from two simultaneous talkers in realistic settings. However, the distributed nature makes it hard to fuse information across microphones efficiently. Therefore, we explore the efficacy of blindly clustering microphones around sources of interest prior to enhancement. Experimental results show that this deep cluster-informed approach significantly improves the system's capacity to cope with the inherent variability observed in ad-hoc distributed microphone environments.
Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of speech separation in randomly - distributed microphone environments. Specifically, traditional deep - learning models usually require a fixed architecture, while the number and position of microphones in such environments are unpredictable, which poses challenges to traditional models. The paper proposes a new method to improve the speech - separation ability for simultaneous speakers in randomly - distributed microphone environments by combining blind - clustering techniques and deep - learning networks. ### Main Problems 1. **Challenges in Randomly - Distributed Microphone Environments**: - The number and position of microphones are unpredictable. - Environmental changes may cause devices to enter or leave or move. - The time delays and sampling rate offsets (SROs) and sample time offsets (STOs) between different microphones are different. - Microphone characteristics (such as frequency response and directivity) may be different. 2. **Limitations of Existing Methods**: - Traditional deep - learning models have difficulty dealing with unknown microphone - array geometries. - Although classical signal - processing methods are effective, the audio quality is poor, and simple beamformers cannot fully eliminate interference. ### Solutions 1. **Blind - Clustering Techniques**: - Use a blind - clustering method based on spatial statistics to cluster microphones around active speakers. - Estimate a pseudo - reference microphone so that the target speech is the most prominent among all microphones in this cluster. 2. **Deep - Learning Networks**: - Combine the Transform - Average - Concatenate (TAC) layer and the Dual - Path Transformer (DPTNet) to process multi - channel time - domain signals within each cluster. - Use the information of the reference microphone to select the hidden features that best represent the target speech for masking operations. 3. **Training - Data Generation**: - Propose a method for simulating clustered data to save training time and ensure that training is independent of specific clustering algorithms. ### Experimental Verification - **Data Sets**: - The training data uses the WSJ0 - 2mix clean - speech data set and is convolved with the shoebox - room impulse responses (RIRs) generated by gpuRIR. - The evaluation data uses the SINS data set to simulate the multi - microphone distribution in real environments. - **Experimental Setup**: - The encoder and decoder use a kernel size of 8 samples and a stride of 50%. - The split size is set to 250, the feature dimension is 64, and the number of attention heads is 4. - The training - loss function is the scale - invariant signal - to - distortion ratio (SI - SDR), the optimizer is Adam, and the initial learning rate is 0.125. - **Experimental Results**: - The results show that the deep - learning method combined with clustering information is significantly superior to classical methods in terms of SI - SDR, PESQ, and STOI metrics. - The method using the reference microphone as the single - channel - model input has a significant performance improvement, especially in terms of the PESQ metric. - The proposed method performs well in utilizing the spatial diversity of microphones within the cluster, verifying the effectiveness of the clustered training data. ### Conclusion The paper proposes a new method that combines blind - clustering techniques and deep - learning networks, effectively solving the problem of speech separation in randomly - distributed microphone environments. The experimental results show that this method is significantly superior to classical methods in multiple objective metrics, demonstrating the advantages of combining traditional signal - processing techniques and modern deep - learning in actual speech - processing tasks. Future work can further explore the possibility of incorporating cross - cluster information in network design.