Discriminative feature modeling for statistical speech recognition

Tüske, Zoltán; Ney, Hermann (Thesis advisor); Heřmanský, Hynek (Thesis advisor)

Aachen : RWTH Aachen University (2020, 2021)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2020


Conventional speech recognition systems consist of feature extraction, acoustic and language modeling blocks, and search block. In a recent trend the traditional modeling approaches in these blocks have been replaced or extended with neural networks. Due to the layered structure of such models, data-driven feature extraction and representation learning happens at multiple levels in modern ASR, besides the traditional cepstral feature extraction. This work revisits and extends these manually and automatically derived features in multiple ways. Acoustic models are traditionally trained on cepstral features. However, the signal analysis is based on the short-time stationary assumption of speech. This is challenged by several acoustical phenomena, therefore in the first part of the thesis we relax this assumption and introduce a novel non-stationary framework to analyze voiced speech. We derive noise robust features from the more precise analysis and extensively evaluate them in noisy speech recognition tasks. Conventional acoustic models are trained on the output of hand-crafted feature extraction pipelines. With the advent of deep learning and big data, it is a reasonable question to challenge the necessity of such feature extraction pipelines. Therefore, we investigate whether direct acoustic modeling of waveform is a viable option. We analyze what representation a deep neural network derives from the highly complex one-dimensional acoustic signal. We also study whether it is useful to a priori choose neural network structures based on decades of speech signal processing knowledge. The two most widely used types of neural network based acoustic models are the hybrid and tandem approaches. The hybrid approach models Markov states directly and has become the de-facto standard approach. Nevertheless, the tandem approach which uses neural network as feature extraction for Gaussian mixture model (GMM) is equally powerful. Applying recent neural network based structures, we propose several modifications to improve the tandem modeling approach. Moreover, a theoretical connection is also presented between hybrid and tandem models, which shows that an optimal tandem model cannot not perform worse than a similar hybrid model. High quality transcribed speech data is a significant cost factor in developing acoustic models for a new language. We present an efficient multilingual neural network framework to reuse resources collected in other languages. By simultaneously training the network on multiple languages, we enforce this model to extract language independent representation. Speech recognition and keyword search experiments show that such representation serves as an excellent initialization, and can significantly improve performance for both under and high resourced languages. We also propose a multilingual feature based rapid development framework, which allows us to reach reasonable performances under extreme short time constraints and very limited data conditions. Motivated by the success of multilingual features, we also investigate multi-domain neural network language models. We show that using a shared, data-driven feature space to represent the language model history, log-linear interpolation framework allows an efficient limited-data domain adaptation, and also results in a compact final model. The framework is evaluated on several speech recognition tasks, using billion-word corpora and with feed-forward and recurrent long short-term memory (LSTM) language models. By applying recently proposed techniques to improve LSTM models, we also examine the effective context length of the best performing models.