Abstract:With the increasing demand for access to content in foreign languages in recent years, we have also seen a steady improvement in the quality of tools that can help bridge this gap. One such tool is Statistical Machine Translation (SMT), which learns automatically from real examples of human translations, without the need for manual intervention. Training such a system takes just a few days, sometimes even hours, but requires a lot of sentences aligned to their corresponding translations, a resource known as a bi-text. Such bi-texts contain translations of written texts as they are typically derived from newswire, administrative, technical and legislation documents, e.g., from the EU and UN. However, with the widespread use of mobile phones and online conversation programs such as Skype as well as personal assistants such as Siri, there is a growing need for spoken language recognition, understanding, and translation. Unfortunately, most bi-texts are not very useful for training a spoken language SMT system as the language they cover is written, which differs from speech in style, formality, vocabulary choice, length of utterances, etc. It turns out that there exists a growing community-generated source of spoken language translations, namely movie subtitles. These come in plain text in a common format in order to facilitate rendering the text segments accordingly. The dark side of subtitles is that they are usually created for pirated copies of copyright-protected movies. Yet, their use in research is an exploitation of a “positive side effect” of Internet movie piracy, which allows for easy creation of spoken bi-texts in a number of languages. This alignment typically relies on a key property of movie subtitles, namely the temporal indexing of subtitle segments, among with other features. Due to the nature of movies, subtitles differ from other resources in several aspects: they are mostly transcriptions of movie dialogues that are often spontaneous speech, which contains a lot of slang, idiomatic expressions, and also fragmented spoken utterances, with repetitions, errors and corrections, rather than grammatical sentences; thus, this material is commonly summarised in the subtitles, rather than being literally transcribed. Since subtitles are user-generated, the translations are free, incomplete and dense (due to summarization and compression) and, therefore, reveal cultural differences. Degrees of rephrasing and compression vary across languages and also depend on subtitling traditions. Moreover, subtitles are created to be displayed in parallel to a movie in order to be linked to the movie's actual sound signal. Subtitles also arbitrarily include some meta information such as the movie title, year of release, genre, subtitle author/translator details and trailers. They may also contain visual translation, e.g., into a sign language. Certain versions of subtitles are especially compiled for the hearing-impaired to include extra information about non-spoken sounds that are either primary, e.g., coughing, or secondary background noises, e.g., soundtrack music, street noise, etc. This brings yet another challenge to the alignment process: the complex mappings caused by many deletions and insertions. Furthermore, subtitles must be short enough to fit the screen in a readable manner and are only shown for a short time period, which presents a new constraint to the alignment of different languages with different visual and linguistic features. The languages a subtitle file is available for differ from one movie to another. Commonly, the Arabic language, even though spoken by more than 420 million people worldwide, and being the 5th most spoken language worldwide, has relatively scarce online presence. For example, according to Wikipedia's statistics of article counts, Arabic is ranked 23rd. Yet, Web traffic analytics shows that search queries for Arabic subtitles and traffic from the Arabic region are among the highest. This increase in demand for Arabic content is not surprising with the recent dramatic economic and socio-political shift in the Arab World. On another note, Arabic, as a Semitic language, has a complex morphology, which requires special handling when mapping it to another language and therefore poses a challenge for machine translation. In this work, we look at movie subtitles as a unique source of bi-texts in an attempt to align as many translations of movies as possible in order to improve English to Arabic SMT. Translating from English into Arabic is an underexplored translation direction and, due to the morphological richness of Arabic among with other factors, yields significantly lower results compared to translating in the opposite direction (Arabic to English). For our experiments, we collected pairs of English-Arabic subtitles for more than 29,000 movies/TV shows, which is a collection that is bigger than any preexisting subtitle data set. We designed a sequence of heuristics to eliminate the inherent noise that comes with the subtitles' source in order to yield good quality alignment. We used time overlap to align the subtitles by utilising the time information provided within the subtitle files and measuring the time overlap. This alignment approach is language-independent and outperforms other traditional approaches such as the length-based approach, which relies on segment boundaries to match translation segments, as segment boundaries differ from one language to another, e.g., because of the need to fit the text on the screen. Our goal was to maximise the number of aligned sentence pairs while minimising the alignment errors. We evaluated our models relatively and also extrinsically, i.e., by measuring the quality of an SMT system that used this bi-text for training. We automatically evaluated our SMT systems using BLEU, a standard measure for machine translation evaluation. We also implemented an in-house Web application tool in order to crowd-source human judgments comparing the SMT baseline's output and our best-performing system's output. Our experiments yielded bi-texts of varied size and relative quality, which we used to train an SMT system. Adding any of our bi-texts improved the baseline SMT system, which was trained on TED talks from the IWSLT 2013 competition. Ultimately, our best SMT system outperformed the baseline by about two BLEU points, which is a very significant improvement, clearly visible to humans; this was confirmed in manual evaluation. We hope that the resulting subtitles corpus, the largest collected so far (about 82 million words), will facilitate research in spoken language SMT.

Automatically Annotate TV Series Subtitles for Dialogue Corpus Construction

Automatic Construction of Discourse Corpora for Dialogue Translation

Character-aware audio-visual subtitling in context

A Manually Annotated Chinese Corpus for Non-task-oriented Dialogue Systems

Dialog Act Annotation for Chinese Daily Conversation

Building Context-Related Dialogue Systems Based on Chinese-Script-Dialogue Corpus

Subtitles to Segmentation: Improving Low-Resource Speech-to-Text Translation Pipelines

Improving Abstractive Dialogue Summarization with Speaker-Aware Supervised Contrastive Learning.

Creating Speech-to-Speech Corpus from Dubbed Series

Detect Turn-takings in Subtitle Streams with Semantic Recall Transformer Encoder

End-to-End Subtitle Detection and Recognition for Videos in East Asian Languages via CNN Ensemble with Near-Human-Level Performance

Autocorrect in the Process of Translation -- Multi-task Learning Improves Dialogue Machine Translation

Language agnostic missing subtitle detection

Chinese Dialogue Analysis Using Multi-Task Learning Framework

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

Building Chinese Sense Annotated Corpus with the Help of Software Tools

Real-Time Automatic Translation Algorithm for Chinese Subtitles in Media Playback Using Knowledge Base

Enhancing Abstractive Dialogue Summarization with Internal Knowledge

Bi-Text Alignment of Movie Subtitles for English-Arabic Statistical Machine Translation

Unsupervised Abstractive Dialogue Summarization for Tete-a-Tetes

Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization