Effective training and efficient decoding for statistical machine translation

  • Effektives Training und effizientes Decodieren für statistische maschinelle Übersetzung

Wübker, Jörn; Ney, Hermann (Thesis advisor); van Genabith, Josef (Thesis advisor)

Aachen (2017)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2017


Statistical machine translation, the task of translating text from one natural language into another using statistical models, can be divided into three main problems: modeling, search and training. This thesis gives a detailed description of the most popular approach to statistical machine translation, the phrase-based paradigm, and presents several improvements to the state of the art in all three of the aspects mentioned above. Regarding the search problem, we propose three novel language model look-ahead techniques which can considerably increase time efficiency of the algorithm with different quality tradeoffs. They are evaluated in detail with respect to their effect on translation quality, translation speed, number of language model queries and number of generated nodes within the search graph. We can show that our final system outperforms the popular Moses toolkit in terms of translation speed. With regard to the modeling problem we extend the state of the art with novel smoothing models based on word classes. Data sparsity is a common pitfall for statistical models. We leverage word classes that can be learned in an unsupervised fashion in order to re-parameterize the standard phrase-based models, resulting in a smoother probability distribution and reduced sparsity. The largest part of this work is dedicated to the training problem. We investigate both generative and discriminative training methods, two fundamentally different approaches to learning statistical models. Our generative procedure is inspired by the expectation-maximization algorithm and based on force-aligning the training data with the application of the leave-one-out technique to avoid overfitting. Its advantage over the standard heuristic model extraction is that it provides a framework which uses the same consistent models in training and search. The initial technique is further developed into a length-incremental procedure which does not require initialization with a Viterbi word alignment and is thus not biased by its inconsistencies. Both the learning procedure and the resulting models are analyzed in detail. As a discriminative training procedure, we employ a gradient-based method to optimize an expected BLEU objective function. Our novel contribution is the application of the resilient backpropagation algorithm, which is experimentally shown to be superior to several previously proposed techniques. It is also significantly more time and memory efficient than previous work, so that we can run training on the largest data set reported in the literature to date. Our novel techniques are experimentally evaluated against internal and external results on large-scale translation tasks and within public evaluation campaigns. Especially the word class language model and discriminative training procedure prove to be valuable for state-of-the-art large scale translation systems.