Abstract:Ad-hoc distributed microphone environments, where microphone locations and numbers are unpredictable, present a challenge to traditional deep learning models, which typically require fixed architectures. To tailor deep learning models to accommodate arbitrary array configurations, the Transform-Average-Concatenate (TAC) layer was previously introduced. In this work, we integrate TAC layers with dual-path transformers for speech separation from two simultaneous talkers in realistic settings. However, the distributed nature makes it hard to fuse information across microphones efficiently. Therefore, we explore the efficacy of blindly clustering microphones around sources of interest prior to enhancement. Experimental results show that this deep cluster-informed approach significantly improves the system's capacity to cope with the inherent variability observed in ad-hoc distributed microphone environments.

What problem does this paper attempt to address?

This paper attempts to solve the problem of speech separation in randomly - distributed microphone environments. Specifically, traditional deep - learning models usually require a fixed architecture, while the number and position of microphones in such environments are unpredictable, which poses challenges to traditional models. The paper proposes a new method to improve the speech - separation ability for simultaneous speakers in randomly - distributed microphone environments by combining blind - clustering techniques and deep - learning networks. ### Main Problems 1. **Challenges in Randomly - Distributed Microphone Environments**: - The number and position of microphones are unpredictable. - Environmental changes may cause devices to enter or leave or move. - The time delays and sampling rate offsets (SROs) and sample time offsets (STOs) between different microphones are different. - Microphone characteristics (such as frequency response and directivity) may be different. 2. **Limitations of Existing Methods**: - Traditional deep - learning models have difficulty dealing with unknown microphone - array geometries. - Although classical signal - processing methods are effective, the audio quality is poor, and simple beamformers cannot fully eliminate interference. ### Solutions 1. **Blind - Clustering Techniques**: - Use a blind - clustering method based on spatial statistics to cluster microphones around active speakers. - Estimate a pseudo - reference microphone so that the target speech is the most prominent among all microphones in this cluster. 2. **Deep - Learning Networks**: - Combine the Transform - Average - Concatenate (TAC) layer and the Dual - Path Transformer (DPTNet) to process multi - channel time - domain signals within each cluster. - Use the information of the reference microphone to select the hidden features that best represent the target speech for masking operations. 3. **Training - Data Generation**: - Propose a method for simulating clustered data to save training time and ensure that training is independent of specific clustering algorithms. ### Experimental Verification - **Data Sets**: - The training data uses the WSJ0 - 2mix clean - speech data set and is convolved with the shoebox - room impulse responses (RIRs) generated by gpuRIR. - The evaluation data uses the SINS data set to simulate the multi - microphone distribution in real environments. - **Experimental Setup**: - The encoder and decoder use a kernel size of 8 samples and a stride of 50%. - The split size is set to 250, the feature dimension is 64, and the number of attention heads is 4. - The training - loss function is the scale - invariant signal - to - distortion ratio (SI - SDR), the optimizer is Adam, and the initial learning rate is 0.125. - **Experimental Results**: - The results show that the deep - learning method combined with clustering information is significantly superior to classical methods in terms of SI - SDR, PESQ, and STOI metrics. - The method using the reference microphone as the single - channel - model input has a significant performance improvement, especially in terms of the PESQ metric. - The proposed method performs well in utilizing the spatial diversity of microphones within the cluster, verifying the effectiveness of the clustered training data. ### Conclusion The paper proposes a new method that combines blind - clustering techniques and deep - learning networks, effectively solving the problem of speech separation in randomly - distributed microphone environments. The experimental results show that this method is significantly superior to classical methods in multiple objective metrics, demonstrating the advantages of combining traditional signal - processing techniques and modern deep - learning in actual speech - processing tasks. Future work can further explore the possibility of incorporating cross - cluster information in network design.

Enhanced Deep Speech Separation in Clustered Ad Hoc Distributed Microphone Environments

Efficient, Cluster-Informed, Deep Speech Separation with Cross-Cluster Information in AD-HOC Wireless Acoustic Sensor Networks

Exploiting Speaker Embeddings for Improved Microphone Clustering and Speech Separation in ad-hoc Microphone Arrays

Audio–Visual Deep Clustering for Speech Separation

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech Separation

Single-Channel Multi-Speaker Separation using Deep Clustering

Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation

Distributed speech separation in spatially unconstrained microphone arrays

Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks

Low-Latency Deep Clustering For Speech Separation

Multi-Head Self-Attention-Based Deep Clustering for Single-Channel Speech Separation

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments

Combining Spatial Clustering with LSTM Speech Models for Multichannel Speech Enhancement

Deep Clustering and Conventional Networks for Music Separation: Stronger Together

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations.

Speaker Recognition Based on Pre-Trained Model and Deep Clustering