Improving Speech Translation by Understanding the Speech From Latent Code

Hao Zhang,Nianwen Si,Wenlin Zhang,Xukui Yang,Dan Qu
DOI: https://doi.org/10.1109/lsp.2024.3393353
2024-05-07
IEEE Signal Processing Letters
Abstract:Due to data scarcity and modal complexity, the semantic representations extracted by the encoder of end-to-end speech translation (E2E-ST) are often flawed, and its decoder will further produce incorrect semantic alignment between the source speech and the target text based on them, which ultimately impairs translation performance. In contrast to previous research, which focused on how to extract better semantic representations, we focus on how to assist the decoder in performing the translation process in the presence of flawed semantic representations. Specifically, we propose a variational speech translation (VST) framework that leverages latent code containing sentence-level semantic information to aid the decoder in accurately aligning the source speech and target text semantically. By leveraging latent code, VST can compensate for flawed frame-level semantic representations from the encoder and aid the decoder in generating accurate translation text. Our experimental results show that VST can be seamlessly integrated with the current state-of-the-art method, achieving substantial performance improvements. Further analysis and visualization demonstrate that the learned latent code indeed contain rich semantic information and can effectively rectify misalignments in decoder.
engineering, electrical & electronic
What problem does this paper attempt to address?