Resource-Efficient Separation Transformer

Luca Della Libera,Cem Subakan,Mirco Ravanelli,Samuele Cornell,Frédéric Lepoutre,François Grondin
2024-01-16
Abstract:Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.
Audio and Speech Processing,Machine Learning,Sound,Signal Processing
What problem does this paper attempt to address?