Efficient, Cluster-Informed, Deep Speech Separation with Cross-Cluster Information in AD-HOC Wireless Acoustic Sensor Networks

Nilesh Madhu,Jihyun Kim,Hong-Goo Kang,Stijn Kindt
DOI: https://doi.org/10.1109/IWAENC61483.2024.10694357
2024-09-09
Abstract:Environments with ad hoc distributed microphones, characterised by unpredictable locations and varying numbers, pose a significant challenge to most conventional deep-learning based speech enhancement and separation models. Transform-Average-Concatenate (TAC) layers in combination with dual-path transformer (DPT) models have been proposed for speech separation with flexible array configurations. For widely distributed microphone setups, we previously showed that blindly clustering microphones around target sound sources and processing each cluster separately yields good separation. In this work, we propose to further improve output signal quality by exploiting inter-cluster information by suitable exchange between clusters using cross-attention transformers in the DPTs. Additionally, we introduce an efficient TAC, that lets us increase the temporal resolution and performance while keeping the computational complexity in check. Experiments in realistically simulated scenarios show increased separation quality by 1.1 dB SI-SDR, 0.02 improvement in STOI and 0.2 increase in PESQ, with significance at the median level.
Engineering,Computer Science
What problem does this paper attempt to address?