Alignment models for recurrent neural networks

Doetsch, Patrick; Ney, Hermann (Thesis advisor); Juan, Alfons (Thesis advisor)

Aachen (2020)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2020


Modern recognition systems for speech and handwriting make use of neural networks to convert the acoustic signal or handwritten image into text. Neural networks hereby learn the required parameters from transcribed data in a training phase. In the beginning only feed-forward neural networks were used, which had to be initialized with the alignment of observations and labels of a previously trained Gaussian hidden Markov model for good performance. More recently, recurrent neural network architectures have been shown to outperform their non-recurrent counterparts, with Long Short-Term Memories being the most prominent example. Recurrent neural networks can model the temporal nature of the data directly, and thus are able to dynamically change the alignment to better fit the model. In this thesis, we will investigate applications and training techniques of recurrent neural network architectures for speech and handwriting recognition. As part of this thesis we developed a neural network toolkit for hardware accelerated training and recognition of speech and handwriting systems. The software allows to train recurrent neural network architectures as well as traditional feed-forward neural networks and is capable of processing very large amounts of data on multiple computing devices. After training, he models can be loaded into the the RWTH Aachen speech recognition toolkit for recognition. Our experiments show that recurrent models outperform feed-forward structures in terms of recognition error and we demonstrate their effectiveness in various experiments on handwriting recognition. Further contributions were made by developing techniques to improve the training performance through optimized data ordering. With our toolkit we then evaluate neural network based methods for handwriting recognition. Our focus is hereby on recurrent topologies that operate on images either in a one-dimensional or two-dimensional fashion, and we investigate various system architectures and implementation techniques. We examine the effectiveness of our proposed solutions on prominent handwriting recognition corpora and compare our systems to other groups in a competitive setting. In the final part of this thesis we investigate the effects of handling the alignment problem within recurrent neural networks. We describe overfitting problems of conventional alignment approaches and study properties of the connectionist temporal classification error criterion. Furthermore, we investigate methods that do not make use of external alignment computations, and instead only rely on a special composition of two recurrent neural networks that is able transcribe input observations into output symbols directly. Motivated by these results, we develop direct hidden Markov models as a novel inverted alignment method, which is able to overcome some of the limitations we noticed, and we evaluate our method on speech and handwriting recognition tasks.