Abstract:End-to-end (E2E) multi-speaker speech recognition with the serialized output training (SOT) strategy demonstrates good performance in modeling diverse speaker scenarios. However, the E2E architecture doesn’t explicitly address the modeling of overlapping speech areas, potentially limiting the model’s ability to generalize. To tackle this issue, we introduce two approaches: overlap-aware encoding method and monotonic attention loss. The former enables the model to acquire knowledge about overlapping speech through multitask learning, while the latter encourages the model to learn specific attention patterns associated with overlap by constraining the attention of adjacent text time steps. Our experimental results on the AliMeeting dataset show that the combination of these two methods effectively enhances the model’s performance.

Improving Multi-Speaker ASR With Overlap-Aware Encoding And Monotonic Attention.