TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

Yueyuan Sui,Minghui Zhao,Junxi Xia,Xiaofan Jiang,Stephen Xia
2024-05-29
Abstract:We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.
Sound,Artificial Intelligence,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address the following issues: Achieving efficient, low-resource consumption acoustic and bone conduction speech enhancement and super-resolution on mobile and wearable devices. Specifically, the paper proposes solutions to the following challenges: 1. **Data Collection Difficulty**: Collecting bone conduction speech data is very time-consuming and data is scarce, which limits the development and application of related technologies. 2. **Performance Gap**: Existing high-performance models (such as GANs) are effective but have large parameter sizes and high memory usage, making them unsuitable for resource-constrained mobile and wearable devices. On the other hand, small models suitable for these devices perform poorly. 3. **Data Scarcity**: Training bone conduction speech super-resolution models requires paired data from bone conduction sensors (BCM or accelerometers) and over-the-air microphones (OTA), but such paired data is hard to obtain. 4. **System Optimization**: In practical applications, the choice of sampling rate and computation location significantly impacts inference time and battery life. To address these issues, the paper proposes TRAMBA, a hybrid model combining Transformer and Mamba architectures, aimed at improving bone conduction speech super-resolution and enhancement while maintaining low memory usage and fast inference speed. The main contributions of TRAMBA include: - On standard intelligibility and quality metrics, TRAMBA outperforms existing state-of-the-art GAN models by 109.1%, with a model size of only 5.2 million parameters. - TRAMBA can adapt to various acoustic modalities, including over-the-air microphones and bone/vibration modalities (BCM and accelerometers). By fine-tuning with only 15 minutes of user data, it can significantly improve performance across different sensor positions. - Integrating TRAMBA into wearable and mobile platforms achieves real-time speech super-resolution and reduces the power consumption of sampling and data transmission by over 50%. - User studies demonstrate that vibration-based sensing modalities significantly outperform over-the-air microphone systems with noise suppression algorithms in noisy environments. In summary, TRAMBA aims to provide an efficient, low-resource consumption solution, enabling the widespread application of bone conduction speech enhancement and super-resolution technologies on mobile and wearable devices.