Dienstag, 05.05.2020, 14.00 Uhr

Advancing Neural Language Modeling in Automatic Speech Recognition



Statistical language modeling is one of the fundamental problems in natural language processing. In the recent years, language modeling has seen great advances by active research and engineering efforts in applying artificial neural networks, especially those which are recurrent. The application of neural language models to speech recognition has now become well established and ubiquitous. Despite this impression of some degree of maturity, we claim that the full potential of the neural network based language modeling is yet to be explored. In this thesis, we further advance neural language modeling in automatic speech recognition, by investigating a number of new perspectives. From the architectural view point, we investigate the newly proposed Transformer neural networks for language modeling application. The original model architecture proposed for machine translation is studied and modified to accommodate the specific task of language modeling. Particularly deep models with about one hundred layers are developed. We present an in-depth comparison with the state-of-the-art recurrent neural network language models based on the long short-term memory.

While scaling up language modeling to larger scale datasets, the diversity of the data emerges as an opportunity and a challenge. The current state-of-the-art neural language modeling lacks a mechanism of handling diverse data from different domains for a single model to perform well across different domains. In this context, we introduce domain robust language modeling with neural networks, and propose two solutions. As a first solution, we propose a new type of adaptive mixture of experts model which is fully based on neural networks. In the second approach, we investigate knowledge distillation from multiple domain expert models, as a solution to the large model size problem seen in the first approach. Methods for practical applications of knowledge distillation to large vocabulary language modeling are proposed, and studied to a large extent.

Finally, we investigate the potential of neural language models to leverage long-span cross-sentence contexts for cross-utterance speech recognition. The appropriate training method for such a scenario is under-explored in the existing works. We carry out systematic comparisons of the training methods, allowing us to achieve improvements in cross-utterance speech recognition. In the same context, we study the sequence length robustness for both recurrent neural networks based on the long short-term memory and Transformers, because such a robustness is one of the fundamental properties we wish to have, in neural networks with the ability to handle variable length contexts. Throughout the thesis, we tackle these problems through novel perspectives of neural language modeling, while keeping the traditional spirit of language modeling in speech recognition.


Es laden ein: die Dozentinnen und Dozenten der Informatik