Montag, 07.12.2020, 16.00 Uhr

Discriminative Feature Modeling for Statistical Speech Recognition

  • Zoom:
  • Referent: Zoltán Tüske, M.Sc.



Conventional speech recognition systems consist of feature extraction, acoustic and language modeling blocks, and search block. In a recent trend, deep neural networks replaced or extended traditional modeling approaches in these blocks. Using such layered structures, data-driven feature extraction and representation learning happens at multiple levels, besides the traditional cepstral feature extraction. This work revisits and extends these manually and automatically derived features in multiple ways.

In the first part, we relax short-time stationary assumption of traditional feature extraction, and a novel non-stationary framework is introduced to perform more precise analysis of voiced speech. The noise-robustness of the derived features are evaluated in standard, noisy speech recognition tasks.

Second, with the advent of deep learning and big data, the necessity of manually designed feature extraction pipelines is challenged, and we investigate whether direct acoustic modeling of waveform is a viable option. The representation learned by a deep neural network is analyzed, and we study whether it is useful to a priori choose neural network structures based on decades of speech signal processing knowledge.

Third, a theoretical connection is presented between the two most widely used and equally powerful types of neural network based acoustic models, the hybrid and tandem approaches. Supported by experimental results, we show that a Gaussian mixture model trained on optimally chosen neural network based features, the tandem approach, cannot perform worse than a similar hybrid model.

High quality transcribed speech data is a significant cost factor in developing acoustic models for a new language. In the fourth part, an efficient multilingual neural network framework is presented to reuse resources collected in other languages in order to improve system performance. Further, a rapid, multilingual feature based framework is proposed, which allows us to reach reasonable performances under extreme short time constraints and very limited data conditions.

Last, we also investigate multi-domain neural network language model structures. The proposed framework allows efficient limited-data domain adaptation, and a shared embedding space of language model history across domains results in a compact final model. Besides comparing the performance of neural network and traditional count based models, we also examine the effective context length of the best performing neural networks.


Es laden ein: die Dozentinnen und Dozenten der Informatik