Neural machine translation for low-resource scenarios

  • Neuronale maschinelle Übersetzung für ressourcenarme Szenarien

Kim, Yunsu; Ney, Hermann (Thesis advisor); Juan-Císcar, Alfons (Thesis advisor)

Aachen : RWTH Aachen University (2022)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2022


Machine translation has been tackled for decades mainly by statistical learning on bilingual text data. In the most recent paradigm with neural network modeling, building a machine translation system requires more data than ever to make the best use of the state-of-the-art modeling capacity and yield a reasonable performance. Unfortunately, however, there is not a sufficient amount of bilingual corpora for many language pairs and domains. To expand the coverage of neural machine translation, this thesis investigates effective methods to improve the performance in such low-resource scenarios. Firstly, we study the usage of monolingual corpora for neural machine translation. To begin with, we optimize the log-linear integration of a language model into translation decoding. Next, we review various synthetic data generation strategies and compare their empirical performance at scale. In addition, we investigate pre-training and multi-task training of a translation model with language modeling and the Cloze task modeling objectives. We compare all these methods empirically to provide best practice for semi-supervised learning to compensate for the performance in a low-resource case. Secondly, we examine the cross-lingual transfer from a high-resource setting to a low-resource setting. This study covers two pragmatic scenarios: transfer between the language pairs whose target side is common and transfer from multiple language pairs based on a pivot language, e.g. for a non-English language pair with English as the pivot. For both scenarios, we develop a series of sequential transfer techniques to maximize the effectiveness of the transfer. The techniques are thoroughly compared to semi-supervised baselines, multilingual models and cascaded architectures. Lastly, we investigate unsupervised learning for neural machine translation, where only monolingual corpora are used to train a translation model. We cover the methods from classical decipherment to sequence-to-sequence training, giving a historical overview of unsupervised translation. For decipherment, we extend its primitive framework to large vocabulary translation by reducing lexicon sizes in training and employing neural network lexicons. Furthermore, we integrate a cross-lingual word embedding lexicon model and apply a neural denoising autoencoder as a postprocess, leading to a novel cascaded combination. Finally, we analyze the most sophisticated method with a sequence-to-sequence model, including extensive experimental results on numerous data settings to find out under which conditions unsupervised learning is useful in practice.


  • Department of Computer Science [120000]
  • Chair of Computer Science 6 (Machine Learning and Reasoning) [122010]