Ninth Workshop on Building and Using Comparable Corpora Workshop Programme
Xinhua Zeng,Shouguo Zheng,Xiongwei Sun,Shaoqi Wang,Shizhuang Weng,Reinhard Rapp,S. Sharoff,A. Aker,G. Grefenstette,Silvia Hansen-Schirra,Michael Mohler,E. Morin,Ted Pedersen,Michel Simard,Andrey Kutuzov,Mikhail Kopotev,Tatyana Sviridenko,Zede Zhu,Pierre Zweigenbaum,Ivanova,Lyubov,Kopotev,Mikhail,Kutuzov Andrey,R. Mitkov,Peter Lang,Romanian,Mendoza Rivera,Mitkov R,G. Corpas,Pastor,Technology Nice,France Mitkov,Pekar V,Blagoev D Mulloni,Blagoev D,Mulloni A,Lyubov Ivanova
2016-01-01
Abstract:Comparable corpora are the most versatile and valuable resource for multilingual Natural Language Processing. The speaker will argue that comparable corpora can support a wider range of applications than has been demonstrated so far in the state of the art. The talk will present completed and ongoing work conducted by the speaker and colleagues from his research group where comparable corpora are employed for different tasks including but not limited to the identification of cognates and false friends, validation of translation universals, language change and translation of multiword expressions. Corpora have long been the preferred resource for a number of NLP applications and language users. They offer a reliable alternative to dictionaries and lexicographical resources which may offer only limited coverage. In the case of terminology, for instance, new terms are coined on a daily basis and dictionaries or other lexical resources, however up-to-date they are, cannot keep up with the rate of emergence of new terms. As a result, terminologists (or term extraction programs) seek to analyse the use and/or identify the translation of a specific term using corpora. Ideally, parallel data would be the best resource both for multilingual NLP applications such as Machine Translation systems and for users such as translators, interpreters or language learners. However, parallel corpora or translation memories may not be available, they may be time-consuming to develop or difficult to acquire as they may be expensive or proprietary. An alternative and more promising approach would be to benefit from comparable corpora which are easier to compile for a specific purpose or task. Comparable corpora, whether strictly comparable by definition or ‘loosely’ comparable, have already been used in applications such as Machine Translation (Rapp, Sharoff and Zeigenbaum 2016) and term extraction and have been used by translators (Corpas and Seghiri 2009). The good news is that comparable corpora can facilitate almost any multilingual application and can beneficial to almost any language user. The view of the speaker is that comparable corpora are the most versatile, valuable and practical resource for multilingual NLP. The invited talk at the BUCC workshop at LREC’2016 will show that comparable corpora can offer more in terms of value and can support a wider range of applications than has been demonstrated so far in the state of the art. The talk will present completed and ongoing work conducted by the speaker and his colleagues at the Research Group in Computational Linguistics at the University of Wolverhampton in the domain of comparable corpora. The talk will start with a discussion of the notion of comparable corpora and issues related to their use and compilation, and will briefly outline work by the speaker and his colleagues on the methodology related to the extraction of comparable documents and the building of purpose-specific comparable corpora. Next the work carried out by the author on the automatic identification of cognates and false friends using comparable data will be presented. This will be followed by the presentation of three novel approaches developed by the speaker which use comparable data but do not resort to any dictionaries or parallel corpora, together with extensive evaluations of their performance. The speaker will then focus on the use of purpose-built comparable corpora and NLP methodology in a project whose objective was to test the validity of so-called translation universals. In particular, the experiments on validating the universals of simplification, convergence and transfer will be detailed. Following from this study, the speaker will outline the work on the use of comparable corpora to track language change over time, in particular the recent changes in lexical density and lexical richness in two consecutive thirty-year time periods in British English (1931–1961 and 1961–1991) and in American English from the 1960s to the 1990s (1961–1992). Finally, the speaker will share the latest results from his work with colleagues on the use of comparable corpora for extracting and translating multiword expressions. The methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled with the help of the ACCURAT toolkit (Su and Babych 2012a) where only documents above a specific threshold were considered for inclusion. The presentation will conclude with the results of an interesting experiment as part of this study which sought to establish whether large loosely comparable data would yield better results than smaller but strictly comparable corpora. Bibliographical References Corpas, G. 2008. Investigar con corpus en traducción: los retos de un nuevo paradigma. Frankfurt: Peter Lang. Corpas, G. and Seghiri M. 2009. "Virtual Corpora as Documentation Resources: Translating Travel Insurance Documents (English-Spanish)". In Beeby, A., Sánchez, P. and Rodríguez P. (Eds) Corpus Use and Learning to Translate. Proceedings from the CULT Conference, Barcelona, Spain, John Benjamins, 75-107. Corpas, G., Mitkov R., Afzal, N. and Garcia Moya, L. 2008. "Translation universals: do they exist? A corpusbased and NLP approach to convergence". Proceedings of the LREC’2008 Workshop on Building and Using Comparable Corpora. Corpas, G., Mitkov R., Afzal, N. and Pekar, V. 2008. "Translation universals: do they exist? A corpus-based NLP study of convergence and simplification". Proceedings of the AMTA’2008 conference, Honolulu, Hawaii, 75-81. Costa, H., Corpas, G., Mitkov, R. and M. Seghiri. 2015. "Towards a Web-based Tool to Semi-automatically Compile, Manage and Explore Comparable and Parallel Corpora". In Proceedings of the 7th International Conference of the Iberian Association of Translation and Interpreting Studies (AIETI’2015). Malaga, Spain Costa, H., Corpas, G. and R. Mitkov. 2015. "Measuring relatedness between documents in comparable corpora". In Proceedings of the 11th International Conference on Terminology and Artificial Intelligence (TIA'15), Granada, Spain, 29-37. Fung, P. and Cheung, P. 2004. "Multi-level bootstrapping for extracting parallel sentences from a quasicomparable corpus". In Proceedings of the 20th international conference on Computational Linguistics (COLING), Geneva, Switzerland Ilisei, I., Inkpen, D., Corpas, G., and Mitkov, R. 2012. "Romanian Translational Corpora: Building Comparable Corpora for Translation Studies". In Proceedings of the 5th Workshop on Building and Using Comparable Corpora (5th BUCC), held in conjunction with LREC 2012, Istanbul, Turkey, 56-61. Kilgarriff , A. 2010. "Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project". In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC’2010, Malta. Mendoza Rivera, O., Mitkov R. and G. Corpas Pastor. 2013. "A Flexible Framework for Collocation Retrieval and Translation from Parallel and Comparable Corpora" In Proceedings of the International Workshop on Multiword units in Machine Translation and Translation Technology. Nice, France. Mitkov R., Pekar V., Blagoev D. and Mulloni A. 2008. "Methods for extracting and classifying pairs of cognates and false friends ". Machine Translation. 21 (1), 29-53. Mitkov, R. 2016. "The benefit of comparable corpora: automatic translation of multiword expressions without translation resources" (forthcoming). In Corpas, G. and Seghiri, M. (Eds). Corpus-based approaches to translation and interpreting: from theory to applications. Peter Lang Pekar V., Mitkov R., Blagoev D. and Mulloni A. 2008. "Finding Translations for Low-Frequency Words in Comparable Corpora". Machine Translation, 20 (4), 247-266. Pekar V., Mitkov R., Blagoev D. and Mulloni A. 2007. "Finding Translations for Low-Frequency Words in Comparable Corpora. " Proceedings of the CONTEXT07 Workshop on "Contextual Information in Semantic Space Models" (CoSmo-2007), Roskille, Denmark, 1725. Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A., and Babych, B. 2012. "ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora. Proceedings of the ACL 2012 System Demonstrations, Jeju, Korea, 9196. Rapp, R, Sharoff, S. and Zweigenbaum, P. (Eds). 2016. Special Issue on using comparable corpora for Machine Translation. Journal of Natural Language Engineering, 22(4). (forthcoming). Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiș, D., Verlic, M., Vasiļjevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Paramita, M. and Pinnis, M. 2012. "Collecting and Using Comparable Corpora for Statistical Machine Translation". Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 438–445. Štajner, S and Mitkov, R. 2012. Using Comparable Corpora to Track Diachronic and Synchronic Changes in Lexical Density and Lexical Richness, in Proceedings of the 5th Workshop on Building and Using Comparable Corpora (5th BUCC), held in conjunction with LREC 2012, Istanbul, Turkey, 88-97. Stambolieva, E. 2012. Compiling Comparable Corpora: A Machine Learning Approach. MSc Dissertation, University of Wolverhampton. Su, F. and Babych, B. 2012a. "Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents". Proceedings of the EACL'12 Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), Avignon, France, 10-19. Su, F. and Babych, B. 2012b. "Development and Application of a Cross-language Document Comparability Metric". Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. 3956-3962. Taslimipoor, S., Mitkov, R., Corpas Pastor, G. and Fazly, A.
What problem does this paper attempt to address?
-
Multi-domain machine translation enhancements by parallel data extraction from comparable corpora
Krzysztof Wołk,Emilia Rejmund,Krzysztof Marasek
DOI: https://doi.org/10.48550/arXiv.1603.06785
2016-03-22
Abstract:Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from previously built comparable corpora. The methodologies are automatic and unsupervised which makes them good for large scale research. The task is highly practical as non-parallel multilingual data occur much more frequently than parallel corpora and accessing them is easy, although parallel sentences are a considerably more useful resource. In this study, we propose a method of automatic web crawling in order to build topic-aligned comparable corpora, e.g. based on the Wikipedia or <a class="link-external link-http" href="http://Euronews.com" rel="external noopener nofollow">this http URL</a>. We also developed new methods of obtaining parallel sentences from comparable data and proposed methods of filtration of corpora capable of selecting inconsistent or only partially equivalent translations. Our methods are easily scalable to other languages. Evaluation of the quality of the created corpora was performed by analysing the impact of their use on statistical machine translation systems. Experiments were presented on the basis of the Polish-English language pair for texts from different domains, i.e. lectures, phrasebooks, film dialogues, European Parliament proceedings and texts contained medicines leaflets. We also tested a second method of creating parallel corpora based on data from comparable corpora which allows for automatically expanding the existing corpus of sentences about a given domain on the basis of analogies found between them. It does not require, therefore, having past parallel resources in order to train a classifier.
Computation and Language,Machine Learning
-
Parallel Corpus Research and Target Language Representativeness: The Contrastive, Typological, and Translation Mining Traditions
Bert Le Bruyn,Martín Fuchs,Martijn van der Klis,Jianan Liu,Chou Mo,Jos Tellings,Henriëtte De Swart
DOI: https://doi.org/10.3390/languages7030176
2022-07-08
Languages
Abstract:This paper surveys the strategies that the Contrastive, Typological, and Translation Mining parallel corpus traditions rely on to deal with the issue of target language representativeness of translations. On the basis of a comparison of the corpus architectures and research designs of the three traditions, we argue that they have each developed their own representativeness strategies: (i) monolingual control corpora (Contrastive tradition), (ii) limits on the scope of research questions (Typological tradition), and (iii) parallel control corpora (Translation Mining tradition). We introduce normalized pointwise mutual information (NPMI) as a bi-directional measure of cross-linguistic association, allowing for an easy comparison of the outcomes of different traditions and the impact of the monolingual and parallel control corpus representativeness strategies. We further argue that corpus size has a major impact on the reliability of the monolingual control corpus strategy and that a sequential parallel control corpus strategy is preferable for smaller corpora.
English Else
-
Bilingual Terminology Extraction from Comparable E-Commerce Corpora
Hao Jia,Shuqin Gu,Yuqi Zhang,Xiangyu Duan
DOI: https://doi.org/10.48550/arXiv.2104.07398
2022-07-29
Abstract:Bilingual terminologies are important machine translation resources in the field of e-commerce, which are usually either manually translated or automatically extracted from parallel data. The human translation is costly and e-commerce parallel corpora is very scarce. However, the comparable data in different languages in the same commodity field is abundant. In this paper, we propose a novel framework of extracting e-commercial bilingual terminologies from comparable data. Benefiting from the cross-lingual pre-training in e-commerce, our framework can make full use of the deep semantic relationship between source-side terminology and target-side sentence to extract corresponding target terminology. Experimental results on various language pairs show that our approaches achieve significantly better performance than various strong baselines.
Computation and Language
-
Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs
Krzysztof Wołk,Krzysztof Marasek
DOI: https://doi.org/10.1016/j.protcy.2014.11.024
2015-09-30
Abstract:Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity of comparable sentence pairs.
Computation and Language,Information Retrieval,Machine Learning
-
Termhood-based Comparability Metrics of Comparable Corpus in Special Domain
Sa Liu,Chengzhi Zhang
DOI: https://doi.org/10.1007/978-3-642-36337-5_15
2013-02-19
Abstract:Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages, such as English, French, and Spanish and so on. So, obtaining comparable corpora automatically for such domains could be an answer to this problem effectively. Comparable corpora, that the subcorpora are not translations of each other, can be easily obtained from web. Therefore, building and using comparable corpora is often a more feasible option in multilingual information processing. Comparability metrics is one of key issues in the field of building and using comparable corpus. Currently, there is no widely accepted definition or metrics method of corpus comparability. In fact, Different definitions or metrics methods of comparability might be given to suit various tasks about natural language processing. A new comparability, namely, termhood-based metrics, oriented to the task of bilingual terminology extraction, is proposed in this paper. In this method, words are ranked by termhood not frequency, and then the cosine similarities, calculated based on the ranking lists of word termhood, is used as comparability. Experiments results show that termhood-based metrics performs better than traditional frequency-based metrics.
Computation and Language
-
A survey of neural-network-based methods utilising comparable data for finding translation equivalents
Michaela Denisová,Pavel Rychlý
2024-10-20
Abstract:The importance of inducing bilingual dictionary components in many natural language processing (NLP) applications is indisputable. However, the dictionary compilation process requires extensive work and combines two disciplines, NLP and lexicography, while the former often omits the latter. In this paper, we present the most common approaches from NLP that endeavour to automatically induce one of the essential dictionary components, translation equivalents and focus on the neural-network-based methods using comparable data. We analyse them from a lexicographic perspective since their viewpoints are crucial for improving the described methods. Moreover, we identify the methods that integrate these viewpoints and can be further exploited in various applications that require them. This survey encourages a connection between the NLP and lexicography fields as the NLP field can benefit from lexicographic insights, and it serves as a helping and inspiring material for further research in the context of neural-network-based methods utilising comparable data.
Computation and Language
-
Generating Virtual Parallel Corpus - A Compatibility Centric Method.
Jia Xu,Weiwei Sun
2011-01-01
Abstract:The processing of many natural languages suffers from scarce linguistic resources. We introduce the idea of compatibility to extend training data for machine translation: If translation hypotheses by multiple systems are measured as compatible, they are considered as reliable predictions. By this way, we generate virtual parallel data per bridge language, and re-compiling on this corpus improves our machine translation quality by more than 30% relatively.
-
Corpus Similarity Measures Remain Robust Across Diverse Languages
Haipeng Li,Jonathan Dunn
DOI: https://doi.org/10.48550/arXiv.2206.04332
2022-06-09
Abstract:This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task. The goal is to quantify (i) the distance between different corpora from the same language and (ii) the homogeneity of individual corpora. Both of these goals are essential for measuring how well corpus-based linguistic analysis generalizes from one dataset to another. The problem is that previous work has focused on Indo-European languages, raising the question of whether these measures are able to provide robust generalizations across diverse languages. This paper uses a register prediction task to evaluate competing measures across 39 languages: how well are they able to distinguish between corpora representing different contexts of production? Each experiment compares three corpora from a single language, with the same three digital registers shared across all languages: social media, web pages, and Wikipedia. Results show that measures of corpus similarity retain their validity across different language families, writing systems, and types of morphology. Further, the measures remain robust when evaluated on out-of-domain corpora, when applied to low-resource languages, and when applied to different sets of registers. These findings are significant given our need to make generalizations across the rapidly increasing number of corpora available for analysis.
Computation and Language
-
A Parallel Corpus of Translationese
Ella Rabinovich,Shuly Wintner,Ofek Luis Lewinsohn
DOI: https://doi.org/10.48550/arXiv.1509.03611
2016-03-06
Abstract:We describe a set of bilingual English--French and English--German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) translation; specifically, they can be used for the task of translationese identification, a research direction that enjoys a growing interest in recent years. To validate the quality and reliability of the corpora, we replicated previous results of supervised and unsupervised identification of translationese, and further extended the experiments to additional datasets and languages.
Computation and Language
-
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models
Peiqin Lin,André F. T. Martins,Hinrich Schütze
2024-06-29
Abstract:Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus containing just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models due to their stronger capacity for cross-task transfer. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.
Computation and Language
-
The case of InterCorp, a multilingual parallel corpus
František Čermák,Alexandr Rosen
DOI: https://doi.org/10.1075/ijcl.17.3.05cer
2012-12-31
International Journal of Corpus Linguistics
Abstract:This paper introduces InterCorp, a parallel corpus including texts in Czech and 27 other languages, available for online searches via a web interface. After discussing some issues and merits of a multilingual resource we argue that it has an important role especially for languages with fewer native speakers, supporting both comparative research and studies of the language from the perspective of other languages. We proceed with an overview of the corpus — the strategy and criteria for including new texts, the representation of available languages and text types, linguistic annotation, and a sketch of pre-processing issues. Finally, we present the search interface and suggest some research opportunities.
linguistics
-
Development of Translation Database based on Chinese-English parallel corpora
He Lianzhen
DOI: https://doi.org/10.3969/j.issn.1003-6105.2007.02.009
2007-01-01
Abstract:This paper reports on a Sino-British joint project that aims to create a Chinese-English Translation Database listing English translation units together with their Chinese equivalents and vice versa.For this purpose,the bilingual texts were first aligned at sentence level and then the Chinese and English texts were annotated respectively.From the aligned texts both Chinese and English multi-word units were identified separately from each corpus.After this,the computer software sought to establish the correspondence between Chinese and English translation units,thus creating a list of bilingual Translation Equivalent Pairs,which were then manually validated and input to the Translation Database.The above approach is content-oriented and characterized by unambiguous words or multi-word units as basic translation units and a collection of bilingual translation units,i.e.translation units in one language and their translation equivalents in the target language.This approach has been shown to help improve the efficiency and precision of machine translation.
-
Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec
Atnafu Lambebo Tonja,Christian Maldonado-Sifuentes,David Alejandro Mendoza Castillo,Olga Kolesnikova,Noé Castro-Sánchez,Grigori Sidorov,Alexander Gelbukh
DOI: https://doi.org/10.48550/arXiv.2305.17404
2023-05-27
Abstract:In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook M2M100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The findings show that the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) affects translation performance and that indigenous languages work better when used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings. The dataset and scripts are available at \url{<a class="link-external link-https" href="https://github.com/atnafuatx/Machine-Translation-Resources" rel="external noopener nofollow">this https URL</a>}
Computation and Language
-
Neural machine translation, corpus and frugality
Raoul Blin
DOI: https://doi.org/10.48550/arXiv.2101.10650
2021-01-26
Computation and Language
Abstract:In machine translation field, in both academia and industry, there is a growing interest in increasingly powerful systems, using corpora of several hundred million to several billion examples. These systems represent the state-of-the-art. Here we defend the idea of developing in parallel <<frugal>> bilingual translation systems, trained with relatively small corpora. Based on the observation of a standard human professional translator, we estimate that the corpora should be composed at maximum of a monolingual sub-corpus of 75 million examples for the source language, a second monolingual sub-corpus of 6 million examples for the target language, and an aligned bilingual sub-corpus of 6 million bi-examples. A less desirable alternative would be an aligned bilingual corpus of 47.5 million bi-examples.
-
No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications
Erik de Vries,Martijn Schoonvelde,Gijs Schumacher
DOI: https://doi.org/10.1017/pan.2018.26
2018-09-11
Political Analysis
Abstract:Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models—such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.
political science
-
Extracting an English-Persian Parallel Corpus from Comparable Corpora
Akbar Karimi,Ebrahim Ansari,Bahram Sadeghi Bigham
DOI: https://doi.org/10.48550/arXiv.1711.00681
2019-04-01
Abstract:Parallel data are an important part of a reliable Statistical Machine Translation (SMT) system. The more of these data are available, the better the quality of the SMT system. However, for some language pairs such as Persian-English, parallel sources of this kind are scarce. In this paper, a bidirectional method is proposed to extract parallel sentences from English and Persian document aligned Wikipedia. Two machine translation systems are employed to translate from Persian to English and the reverse after which an IR system is used to measure the similarity of the translated sentences. Adding the extracted sentences to the training data of the existing SMT systems is shown to improve the quality of the translation. Furthermore, the proposed method slightly outperforms the one-directional approach. The extracted corpus consists of about 200,000 sentences which have been sorted by their degree of similarity calculated by the IR system and is freely available for public access on the Web.
Computation and Language,Information Retrieval
-
Principled Paraphrase Generation with Parallel Corpora
Aitor Ormazabal,Mikel Artetxe,Aitor Soroa,Gorka Labaka,Eneko Agirre
DOI: https://doi.org/10.48550/arXiv.2205.12213
2023-05-23
Abstract:Round-trip Machine Translation (MT) is a popular choice for paraphrase generation, which leverages readily available parallel corpora for supervision. In this paper, we formalize the implicit similarity function induced by this approach, and show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation. Based on these insights, we design an alternative similarity metric that mitigates this issue by requiring the entire translation distribution to match, and implement a relaxation of it through the Information Bottleneck method. Our approach incorporates an adversarial term into MT training in order to learn representations that encode as much information about the reference translation as possible, while keeping as little information about the input as possible. Paraphrases can be generated by decoding back to the source from this representation, without having to generate pivot translations. In addition to being more principled and efficient than round-trip MT, our approach offers an adjustable parameter to control the fidelity-diversity trade-off, and obtains better results in our experiments.
Computation and Language
-
Automatic construction of English/Chinese parallel corpora
Christopher C. Yang,Kar Wing Li
DOI: https://doi.org/10.1002/asi.10261
2003-01-01
Journal of the American Society for Information Science and Technology
Abstract:As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross‐lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries are the most typical tools. However, the general‐purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus‐based approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain‐specific parallel or comparable corpora that are employed in machine translation and cross‐lingual information retrieval. Most of these are corpora between Indo‐European languages, such as English/French and English/Spanish. The Asian/Indo‐European corpus, especially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is presented which is based on dynamic programming to identify the one‐to‐one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliable Chinese translation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806. The release articles and speech articles, published by Hongkong & Shanghai Banking Corporation Limited, are also used to test our method, the precision is 1.00, and the recall is 0.948.
-
Using Document Similarity Methods to create Parallel Datasets for Code Translation
Mayank Agarwal,Kartik Talamadupula,Fernando Martinez,Stephanie Houde,Michael Muller,John Richards,Steven I Ross,Justin D. Weisz
DOI: https://doi.org/10.48550/arXiv.2110.05423
2021-10-11
Computation and Language
Abstract:Translating source code from one programming language to another is a critical, time-consuming task in modernizing legacy applications and codebases. Recent work in this space has drawn inspiration from the software naturalness hypothesis by applying natural language processing techniques towards automating the code translation task. However, due to the paucity of parallel data in this domain, supervised techniques have only been applied to a limited set of popular programming languages. To bypass this limitation, unsupervised neural machine translation techniques have been proposed to learn code translation using only monolingual corpora. In this work, we propose to use document similarity methods to create noisy parallel datasets of code, thus enabling supervised techniques to be applied for automated code translation without having to rely on the availability or expensive curation of parallel code datasets. We explore the noise tolerance of models trained on such automatically-created datasets and show that these models perform comparably to models trained on ground truth for reasonable levels of noise. Finally, we exhibit the practical utility of the proposed method by creating parallel datasets for languages beyond the ones explored in prior work, thus expanding the set of programming languages for automated code translation.
-
Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications
Ralf Steinberger,Bruno Pouliquen,Camelia Ignat
DOI: https://doi.org/10.48550/arXiv.cs/0609064
2006-09-12
Computation and Language
Abstract:We are proposing a simple, but efficient basic approach for a number of multilingual and cross-lingual language technology applications that are not limited to the usual two or three languages, but that can be applied with relatively little effort to larger sets of languages. The approach consists of using existing multilingual linguistic resources such as thesauri, nomenclatures and gazetteers, as well as exploiting the existence of additional more or less language-independent text items such as dates, currency expressions, numbers, names and cognates. Mapping texts onto the multilingual resources and identifying word token links between texts in different languages are basic ingredients for applications such as cross-lingual document similarity calculation, multilingual clustering and categorisation, cross-lingual document retrieval, and tools to provide cross-lingual information access.