Abstract:Neural Machine Translation with its significant results, still has a great problem: lack or absence of parallel corpus for many languages. This article suggests a method for generating considerable amount of parallel corpus for any language pairs, extracted from open source materials existing on the Internet. Parallel corpus contents will be derived from video subtitles. It needs a set of video titles, with some attributes like release date, rating, duration and etc. Process of finding and downloading subtitle pairs for desired language pairs is automated by using a crawler. Finally sentence pairs will be extracted from synchronous dialogues in subtitles. The main problem of this method is unsynchronized subtitle pairs. Therefore subtitles will be verified before downloading. If two subtitle were not synchronized, then another subtitle of that video will be processed till it finds the matching subtitle. Using this approach gives ability to make context based parallel corpus through filtering videos by genre. Context based corpus can be used in complex translators which decode sentences by different networks after determining contents subject. Languages have many differences in their formal and informal styles, including words and syntax. Other advantage of this method is to make corpus of informal style of languages. Because most of movies dialogues are parts of a conversation. So they had informal style. This feature of generated corpus can be used in real-time translators to have more accurate conversation translations.

Pansori: ASR Corpus Generation from Open Online Video Contents

Kosp2e: Korean Speech to English Translation Corpus

KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition

Creating Speech-to-Speech Corpus from Dubbed Series

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video

Construction of a Large-scale Japanese ASR Corpus on TV Recordings

CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese

Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

Generating Multilingual Parallel Corpus Using Subtitles

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

YODAS: Youtube-Oriented Dataset for Audio and Speech

KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Building Korean Sign Language Augmentation (KoSLA) Corpus with Data Augmentation Technique

Speech Corpus for Korean Children with Autism Spectrum Disorder: Towards Automatic Assessment Systems

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification

A Novel Task-Oriented Text Corpus in Silent Speech Recognition and its Natural Language Generation Construction Method

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis