Direct Speech Translation for Automatic Subtitling

Sara Papi,Marco Gaido,Alina Karakanta,Mauro Cettolo,Matteo Negri,Marco Turchi
2023-07-26
Abstract:Automatic subtitling is the task of automatically translating the speech of audiovisual content into short pieces of timed text, i.e. subtitles and their corresponding timestamps. The generated subtitles need to conform to space and time requirements, while being synchronised with the speech and segmented in a way that facilitates comprehension. Given its considerable complexity, the task has so far been addressed through a pipeline of components that separately deal with transcribing, translating, and segmenting text into subtitles, as well as predicting timestamps. In this paper, we propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model. Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition, also being competitive with production tools on both in-domain and newly-released out-domain benchmarks covering new scenarios.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the problem of automatic subtitle generation, specifically translating speech from audio-visual content directly into subtitles in the target language and generating corresponding display timestamps. Traditional methods usually adopt a pipeline approach, handling Automatic Speech Recognition (ASR), timestamp extraction, subtitle segmentation, and Machine Translation (MT) separately. This paper proposes a single-model solution based on Direct Speech Translation (ST), which can simultaneously accomplish translation and timestamp generation tasks. The authors believe that existing automatic subtitle generation technologies have the following shortcomings: 1. Multi-step processes lead to error accumulation. 2. Lack of a unified model to handle both translation and timestamp generation simultaneously. 3. Existing benchmark datasets (such as MuST-Cinema) are limited to single-speaker scenarios without background noise, making it impossible to comprehensively evaluate the performance of automatic subtitle systems. To this end, the authors propose a brand-new automatic subtitle system with the following features: - Uses a single direct ST model to generate target language subtitles and their timestamps. - Introduces two new benchmark test sets (en→de and en→es), covering new domains such as news/documentaries and interviews, and including background noise and multiple speakers to better evaluate the system's real-world performance. Experimental results show that the proposed direct ST model outperforms pipeline systems across multiple language pairs and demonstrates performance comparable to or even better than existing production tools in both in-domain and out-of-domain benchmark tests.