Domain adaptation for statistical machine translation

Mansour, Saab; Ney, Hermann (Thesis advisor); Sima'an, Khalil (Thesis advisor)

Aachen (2017, 2018)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2017


In this thesis we develop and evaluate a general framework for domain-adaptation of statistical machine translation (SMT) systems. The framework relies on the availability of in-domain training data and a scoring scheme to differentiate the other-domain training instances. Adapted models include various models used in the translation process, but more focus is given to the less researched phrase model adaptation. The language model is utilized in many applications, e.g. speech recognition and character recognition, and domain adaptation has been extensively researched for this model. Domain-adaptation is the task of adapting an existing general-domain system to perform better on a target domain evaluation set. Weighting the training data has been proposed in the past to perform domain adaptation. Prominent previous work used language model perplexities to perform the scoring. We present a general framework to perform the weighting. Moreover, we develop few novel scoring models that rely on translation models scores to perform the scoring. We hypothesize that for translation model adaptation, translation model scores are more relevant than language model scores as the former captures bilingual dependencies which are fundamental for the translation task. The main part of the thesis includes the development of few scoring schemes for adaptation. Novelties include the usage of IBM Model~1 perplexities, and more prominently, developing a method to generate translation model scores representing relatedness to the target domain. The methods are evaluated consistently on competitive Arabic-to-English and German-to-English translation tasks and significant improvements in translation quality are reported. A limitation of the framework presented is the reliance on in-domain training data. We tackle the scenario where no explicit bilingual in-domain training data exists to perform adaptation. We rely on monolingual source test data to induce the domain and show that the novel usage of automatic translations of the source test data can improve over a state-of-the-art SMT system. We expand the notion of domain to include information about dialects. In our setup we tackle the translation of dialectal Egyptian Arabic to English. A dialect classifier is developed within this work achieving state-of-the-art classification accuracy. The classifier is then used in several techniques to perform adaptation of a general SMT system and improvements are reported. Finally, we present work on Arabic segmentation for machine translation. Arabic is a morphologically rich language, where each word is composed of several morphemes that correspond to several words in English. Different segmentation schemes and models are implemented within this work. We show that the performance of the segmentation schemes varies according to the domain and a careful design of a scheme is required to perform best on a given domain. A combination strategy is then presented and best practices of performing the combination are discussed.


  • Chair of Computer Science 6 (Machine Learning and Reasoning) [122010]
  • Department of Computer Science [120000]