IBACodec: End-to-end speech codec with intra-inter broad attention
Xiaonan Yang,Jinjie Zhou,Deshan Yang,Yunwei Wan,Limin Pan,Senlin Luo
DOI: https://doi.org/10.1016/j.ipm.2024.103979
IF: 7.466
2024-12-05
Information Processing & Management
Abstract:Speech compression attempts to yield a compact bitstream that can represent a speech signal with minimal distortion by eliminating redundant information, which is increasingly challenging as the bitrate decreases. However, existing neural speech codecs do not fully exploit the information from previous speech sequences, and learning encoded features blindly leads to the ineffective removal of redundant information, resulting in suboptimal reconstruction quality. In this work, we propose an end-to-end speech codec with intra-inter broad attention, named IBACodec, that efficiently compresses speech across different types of datasets, including LibriTTS, LJSpeech, and more. By designing an intra-inter broad transformer that integrates multi-head attention networks and LSTM, our model captures broad attention with direct context awareness between the intra- and inter-frames of speech. Furthermore, we present a dual-branch conformer for channel-wise modeling to effectively eliminate redundant information. In subjective evaluations using speech at a 24 kHz sampling rate, IBACodec at 6.3 kbps is comparable to SoundStream at 9 kbps and better than Opus at 9 kbps, with about 30 % fewer bits. Objective experimental results show that IBACodec outperforms state-of-the-art codecs across a wide range of bitrates, with an average ViSQOL, LLR, and CEP improvement of up to 4.97 %, 38.94 %, and 25.39 %, respectively.
computer science, information systems,information science & library science