Abstract:We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.

What problem does this paper attempt to address?

The paper attempts to address the following issues: Achieving efficient, low-resource consumption acoustic and bone conduction speech enhancement and super-resolution on mobile and wearable devices. Specifically, the paper proposes solutions to the following challenges: 1. **Data Collection Difficulty**: Collecting bone conduction speech data is very time-consuming and data is scarce, which limits the development and application of related technologies. 2. **Performance Gap**: Existing high-performance models (such as GANs) are effective but have large parameter sizes and high memory usage, making them unsuitable for resource-constrained mobile and wearable devices. On the other hand, small models suitable for these devices perform poorly. 3. **Data Scarcity**: Training bone conduction speech super-resolution models requires paired data from bone conduction sensors (BCM or accelerometers) and over-the-air microphones (OTA), but such paired data is hard to obtain. 4. **System Optimization**: In practical applications, the choice of sampling rate and computation location significantly impacts inference time and battery life. To address these issues, the paper proposes TRAMBA, a hybrid model combining Transformer and Mamba architectures, aimed at improving bone conduction speech super-resolution and enhancement while maintaining low memory usage and fast inference speed. The main contributions of TRAMBA include: - On standard intelligibility and quality metrics, TRAMBA outperforms existing state-of-the-art GAN models by 109.1%, with a model size of only 5.2 million parameters. - TRAMBA can adapt to various acoustic modalities, including over-the-air microphones and bone/vibration modalities (BCM and accelerometers). By fine-tuning with only 15 minutes of user data, it can significantly improve performance across different sensor positions. - Integrating TRAMBA into wearable and mobile platforms achieves real-time speech super-resolution and reduces the power consumption of sampling and data transmission by over 50%. - User studies demonstrate that vibration-based sensing modalities significantly outperform over-the-air microphone systems with noise suppression algorithms in noisy environments. In summary, TRAMBA aims to provide an efficient, low-resource consumption solution, enabling the widespread application of bone conduction speech enhancement and super-resolution technologies on mobile and wearable devices.

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

A Wearable Vision-To-Audio Sensory Substitution Device for Blind Assistance and the Correlated Neural Substrates

Enabling Real-Time On-Chip Audio Super Resolution for Bone-Conduction Microphones

Selective State Space Model for Monaural Speech Enhancement

Mamba in Speech: Towards an Alternative to Self-Attention

Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis

A MVDR- MWF Combined Algorithm for Binaural Hearing Aid System

In-Ear-Voice: Towards Milli-Watt Audio Enhancement With Bone-Conduction Microphones for In-Ear Sensing Platforms

Fusing Bone-Conduction and Air-Conduction Sensors for Complex-Domain Speech Enhancement.

An Investigation of Incorporating Mamba for Speech Enhancement

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices

A Supervised Speech Enhancement Method for Smartphone-Based Binaural Hearing Aids

A Smart Binaural Hearing Aid Architecture Based on a Mobile Computing Platform

A Smart Binaural Hearing Aid Architecture Leveraging a Smartphone APP with Deep-Learning Speech Enhancement.

SimulTron: On-Device Simultaneous Speech to Speech Translation

Doppler Radar-Based Human Speech Recognition Using Mobile Vision Transformer

PoinTramba: A Hybrid Transformer-Mamba Framework for Point Cloud Analysis

Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement

MC-SEMamba: A Simple Multi-channel Extension of SEMamba

Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix Factorization via Plastic Transformer

Speech-T: Transducer for Text to Speech and Beyond