Abstract:In this presentation, we report experiments on developing statistical machine translation (SMT) systems of practical use for the professional translation of subtitles. We present results of several methods that were tested for this task, describing both positive and negative outcomes. We believe these results to be of interest for companies considering the integration of SMT in multilingual commercial systems, and researchers interested in the use of current methods for large-scale SMT systems development in a specific domain. The work we describe is part of the SUMAT project, funded through the EU ICT Policy Support Programme (20112014), whose goal is to produce machine translation systems for film and TV subtitles for seven language pairs. Nine partners are involved in the project: four subtitle companies (Deluxe Digital Studios, InVision, Titelbild, Voice & Script International) and five technical partners (Athens Technology Center, CapitaTI, TextShuttle, University of Maribor and Vicomtech). In order to integrate SMT systems into a commercially viable translation workflow, it is vital for such systems to meet quality levels that do not hinder on the post-editing experience. Previous experiments (Bywood et al., 2012) have shown that, even in cases of increased productivity for professional translators post-editing machinetranslated output, the perception and use of the systems is negatively affected overall by output of poor quality. To overcome this issue and raise SMT quality, we explored several approaches, taking into account issues of training and decoding efficiency, as well as issues regarding the integration of data from different sources and domains. The baseline SMT phrase-based systems were trained on large numbers of translated subtitles provided by the subtitling companies (between 200,000 and 2 million subtitles per language pair), using the Moses framework (Koehn et al., 2007). To improve the baselines, two sets of experiments were performed: incorporating linguistic information (including factored models in various configurations (Koehn and Hoang, 2007), syntax-based statistical translation and decompounding), and development of larger models by combining in-domain and out-of-domain data via mixture-modeling and perplexity minimization techniques (Sennrich, 2012). Overall, the first approach provided little to no improvement over the baselines, whereas the second one proved successful at a comparatively lower cost. In this talk, we will describe the main experiments and their results, offering insight on the optimal balance between development costs and the requirement for better systems accuracy in professional applications. Sima’an, K., Forcada, M.L., Grasmick, D., Depraetere, H., Way, A. (eds.) Proceedings of the XIV Machine Translation Summit (Nice, September 2–6, 2013), p. 369–370. c ©2013 The authors. This article is licensed under a Creative Commons 3.0 licence, no derivative works, attribution, CC-BY-ND.

Subtitles to Segmentation: Improving Low-Resource Speech-to-Text Translation Pipelines

Segmenting Subtitles for Correcting ASR Segmentation Errors

Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Lightweight Audio Segmentation for Long-form Speech Translation

Character-aware audio-visual subtitling in context

Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations

SMT Approaches for Commercial Translation of Subtitles

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Improved Long-Form Spoken Language Translation with Large Language Models

SBAAM! Eliminating Transcript Dependency in Automatic Subtitling

Direct Speech Translation for Automatic Subtitling

Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR

Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation

Evaluating Subtitle Segmentation for End-to-end Generation Systems

TAMS: Translation-Assisted Morphological Segmentation

Controlling Utterance Length in NMT-based Word Segmentation with Attention

Finding Better Subword Segmentation for Neural Machine Translation

Lexically Grounded Subword Segmentation

Long-Form End-to-End Speech Translation via Latent Alignment Segmentation

Learning Adaptive Segmentation Policy for Simultaneous Translation