Investigations on neural networks, discriminative training criteria and error bounds

Nußbaum-Thom, Markus; Ney, Hermann (Thesis advisor); Häb-Umbach, Reinhold (Thesis advisor)

Aachen (2020, 2021)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2020


The task of an automatic speech recognition system is to convert speech signals into written text by choosing the recognition result according to a statistical decision rule. The discriminative training of the underlying statistical model is an essential part to improve the word error rate performance of the system. In automatic speech recognition a mismatch exists between the loss used in the word error rate performance measure, the loss of the decision rule and the loss of the discriminative training criterion. In the course of this thesis the analysis of this mismatch leads to the development of novel error bounds and training criteria. The novel training criteria are evaluated in practical speech recognition experiments. In summary, we come to the conclusion the statistical model is able to compensate for this mismatch if the discriminative training criterion involves the loss of the performance measure.Automatic speech recognition is based on Bayes decision rule. This rule chooses the most probable sentence as the recognition result for a given speech signal. The word error rate measures the performance of the recognition result. This measure is based on the Levenshtein loss and calculates the minimum number of insertions, deletions, and substitutions to transform the spoken into the recognized sentence. However, this choice of performance measure bears a fundamental mismatch to the one targeted in the maximum probability decision rule, as by definition, Bayes decision rule minimizes the sentence error rate, which does not guarantee to optimize the performance measure of automatic speech recognition — the word error rate. The straightforward approach to overcome this problem incorporates the Levenshtein loss into Bayes decision rule by choosing the recognition result according to the sentence minimizing the posterior-expected Levenshtein loss. Nevertheless, the evaluation of this decision rule is too time and memory consuming. It only is performed as a post-processing step after the search of the maximum probability decision rule. In practice, we have to make a model assumption to Bayes decision theory. The theory assumes the true distribution, which is the empirical prior of all speech signals and spoken sentences. This distribution is unknown in practice. To stay as close to the principle of Bayes decision rule, a model distribution with free parameters substitutes the true distribution. The corresponding maximum probability decision rule using the model is called the model-based decision rule. The free parameters of the model are learned from training data, e.g., with generative training. Subsequently, discriminative training finetunes the model. For automatic speech recognition, the type of discriminative training criterion plays a crucial role. For example, the Minimum Phone Error (MPE) criterion, which involves the Levenshtein loss, performs better than other discriminative criteria like cross-entropy or maximum-mutual-information. Apart from its superior practical performance, the MPE criterion has a lack of theoretical justification. In contrast to this criterion, the cross-entropy criterion can be derived based on a formal derivation scheme from the Kullback-Leibler divergence comparing the true and model distribution. In this scheme, the Kullback-Leibler divergence is an upper bound to the error difference between the model-based and Bayes decision rule. The error difference measures the performance difference between both decision rules. For the MPE criterion, different from the cross-entropy criterion, no such derivation scheme exists relating the training criterion to an upper bound on the error difference. In this thesis, we close this gap and give a theoretical justification for the MPE criterion. In the first part of this thesis, we develop a scheme to derive discriminative training criteria from bounds on the error difference between the model-based and Bayes decision rule. The f-Divergence is the basis for the examined error bounds. This divergence family is a generalization of the Kullback-Leibler divergence and is used to compare two distributions. We start by formulating proofs to derive upper f-Divergence bounds on the classification error difference. These proofs are then extended to error bounds based on a more general loss. These also include error bounds based on the Levenshtein loss, which are relevant to the mismatch between performance measure and model-based decision rule in automatic speech recognition. We ultimatively find a type of explicit bound which is suitable to derive discriminative training criteria. Before this thesis, no derivation scheme for more general losses like the Levenshtein loss existed relating the training criterion to an upper bound on the error difference. Practical automatic speech recognition experiments evaluate our novel training criteria. These experiments include frame-wise training of neural network training as well as sequence training of log-linear mixture models. We show that our novel f-Divergence training criteria achieve a competitive or better performance than the conventional cross-entropy and minimum phone error criteria. The second part of this thesis summarizes our successful participation in the QUAERO project evaluation campaign. We contributed the automatic speech recognition system for German in all project periods achieving the best or competitive results.