Informatik-Oberseminar: Domain Adaptation for Statistical Machine Translation

Freitag, 19.05.2017, 10.00 Uhr

Ort: Informatikzentrum, Gebäude E3, Raum 9222, Ahornstraße 55, 52056 Aachen

Referent: Saab Mansour (M.Sc. Computer Science)


Statistical machine translation (SMT) relies on extensive amounts of training data to train models which can generate high quality translations. Recently, approaches which rely on more data to improve the models have reached a plateau in quality. According to the head of Google Translate, for the most common language pairings, "we have reached about the limit where more data is helpful." One way to overcome this limitation is creating SMT systems which are tailored towards a specific domain. Applying a single translation engine universally for all domains can hurt the performance of the translation and increase ambiguity. In this work we develop and evaluate a general framework for domain-adaptation of SMT systems. The framework relies on the availability of in-domain training data and a scoring scheme to differentiate the other-domain training instances. Prominent previous work used language model perplexities to perform the scoring. The main part of the thesis includes developing few novel scoring models that rely on translation models scores to perform the scoring. We hypothesize that for translation model adaptation, translation model scores are more relevant than language model scores as the former captures bilingual dependencies which are fundamental for the translation task. The methods are evaluated consistently on competitive large-scale translation tasks and significant improvements in translation quality are reported over state-of-the-art systems.

Es laden ein: Die Dozenten der Informatik