Neural sequence-to-sequence modeling for language and speech translation

Bahar, Parnia; Ney, Hermann (Thesis advisor); Yvon, Francois (Thesis advisor); Decker, Stefan Josef (Thesis advisor)

Aachen : RWTH Aachen University (2022, 2023)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2022


In recent years, various fields in human language technology have been advanced by the success of neural sequence-to-sequence modeling. The application of attention models to automatic speech recognition, text and speech machine translation has become dominant and well-established. Although the effectiveness of such models has been documented in scientific papers, not all aspects of attention sequence-to-sequence models have been explored, and some essential concepts are still missing. Therefore, the main contribution of this thesis centers around redesigning attention models by proposing novel alternative models in terms of architecture and mathematical formulation for language technology. As attention models do not make any conditional dependence assumption on previous attention information,inspired by statistical word alignments, this work first extends recurrent attention models by implicitly including more alignment information from previous output positions. Furthermore, from a modeling perspective, this research goes beyond current sequence-to-sequence backbone models to directly incorporate input and output sequences in a 2D structure where an attention mechanism is no longer required. This model distinguishes itself from attention models in which inputs and outputs are treated as one-dimensional sequences over time and then combined with an attention mechanism. In contrast to attention models which do not reinterpret encoder states while decoding, the proposed model enhances the degree of variance in context vectors by refining input representations to be sensitive to the partial translation.Current state-of-the-art attention models also lack an explicit alignment, a core component of traditional systems. Instead, their attention mechanism may be considered to produce an implicit alignment. Such a gross simplification of a complex process complicates the extraction of alignments between input and output positions. To enable attention models to be explainable and their output to be better controlled, the next part of this study integrates the attention model into the hidden Markov model formulation by introducing alignments as a sequence of hidden variables. Since marginalization has an exponential number of terms in the alignment dependency order of the model, a zero-order assumption that is simpler and more efficient is explored.Finally, an exciting research direction is to combine speech recognition with text machine translation for speech-to-text translation. Besides advancing a cascade of independently trained speech recognition and machinetranslation systems, this thesis sheds light on multiple end-to-end models to directly translate speech inputs to target texts. In this context, promising methods are borrowed from speech recognition, and best practices are established for direct modeling. Addressing and revisiting already proposed methods in the literature, the last part of this study investigates and develops new approaches to leverage all types of available training data, i.e., speech-to-source, source-to-target, and speech-to-target text data. Ultimately, it is shown that end-to-end models can practically translate speech utterances as a substitute solution to cascaded speech translation.


  • Department of Computer Science [120000]
  • Chair of Computer Science 6 (Machine Learning and Reasoning) [122010]