Informatik-Oberseminar: Statistical Models for Hierarchical Phrase-based Machine Translation

Mittwoch, 01.08.2018, 10.00 Uhr

Ort: Informatik E3, Raum 9222

Referent: Dipl.-Inform. Matthias Huck

Abstract:

Machine translation systems automatically translate texts from one natural language to another. The dominant approach to machine translation has been phrase-based statistical machine translation for many years. In statistical machine translation, probabilistic models are learned from training data, and a decoder is conducting a search to determine the best translation of an input sentence based on model scores. Phrase-based systems rely on elementary translation units that are continuous bilingual sequences of words, called phrases. The hierarchical approach to statistical machine translation allows for phrases with gaps. Formally, the hierarchical phrase inventory can be represented as a synchronous context-free grammar that is induced from bilingual text, and hierarchical decoding can be carried out with a parsing-based procedure. The hierarchical phrase-based machine translation paradigm enables modeling of reorderings and long-distance dependencies in a consistent way. The typical statistical models that guide hierarchical search are fairly similar to those employed in conventional phrase-based translation. In this work, novel extensions with statistical models for hierarchical phrase-based machine translation are developed, with a focus on methods that do not require any syntactic annotation of the data. Specifically, enhancements of hierarchical systems with extended lexicon models that take global source sentence context into account are investigated; various lexical smoothing variants are examined; reordering extensions and a phrase orientation model for hierarchical translation are introduced; word insertion and deletion models are presented; techniques for training of hierarchical translation systems with additional synthetic data are suggested; and a training method is proposed that utilizes additional synthetic data which is created via a pivot language. The beneficial impact of the extensions on translation quality is verified by means of empirical evaluation on various language pairs, including Arabic-English, Chinese-English, French-German, English-French, and German-French.

Es laden ein: Die Dozenten der Informatik