Dienstag, 07. Juni 2022, 14:00 Uhr

Neural Network based Modeling and Architectures for Automatic Speech Recognition and Machine Translation



Our work aims to advance the field and application of neural networks, to advance sequence-to-sequence architectures by extending and developing new approaches, and to improve training methods. We perform a comprehensive study of long short-term memory (LSTM) acoustic models and improve over our feed-forward neural network (FFNN) baseline by 16% relative. Layer-normalized (LN) LSTM variants further enhance this by up to 10% relative with improved training stability and better convergence. Our comparison of Transformer and LSTM models yields state-of-the-art Transformer language models with 6% relative improvements over the best LSTM. We aim to advance the status quo which is the hybrid neural network (NN)-hidden Markov model (HMM) by investigating alternative sequence-to-sequence architectures. We develop state-of-the-art attention-based models for machine translation and speech recognition. With the motivation to introduce monotonicity and potential streaming, we propose latent local attention segmental models with hard attention as a special case. We discover the equivalence of segmental and transducer models, and propose a novel class of generalized and extended transducer models, which perform and generalize better than our attention models.

Our work shows that training strategies such as learning rate scheduling, data augmentation, and regularization play the most important role in good performance. Our novel pretraining schemes, where we grow the depth and width of the neural network, improve convergence and performance. A generalized training procedure for hybrid NN-HMMs is studied, which includes the full sum over all alignments, where we identify connectionist temporal classification (CTC) as a special case. Our novel mathematical analysis explains the peaky behavior of CTC and its convergence properties.

We develop large parts of RETURNN as an efficient and flexible software framework including beam search to perform all the experiments. This framework and most of our results and baselines are widely used among the team and beyond. All of our work is published and all code and setups are available online.


Es laden ein: die Dozentinnen und Dozenten der Informatik