Minimum Phone Error Training of Acoustic Models and Features for Large Vocabulary Mandarin Speech Recognition
Date Issued
2006
Date
2006
Author(s)
Chen, Jia-Yu
DOI
zh-TW
Abstract
Traditional speech recognition uses maximum likelihood estimation to train parameters of HMM. Such method can make correct transcript have largest posterior probability; however it can’t separate confused models effectively. Discriminative training can take correct transcript and recognized result into consideration at the same time, trying to separate confused models in high dimensional space.
Based on minimum phone error (MPE) and feature-space minimum phone error (fMPE), the thesis will introduce discriminative training’s background knowledge, basic theory and experimental results. The thesis has four parts:
The first part is the basic theory, including risk estimation and auxiliary function. Risk estimation starts from minimum Bayesian risk, introducing widely explored model training methods, including maximum likelihood estimation, maximum mutual information estimation, overall risk criterion estimation, and minimum phone error. The objective functions can be regarded as extension of Bayesian risk. In addition, the thesis will review strong-sense and weak-sense auxiliary functions and smoothing function. Strong-sense and weak-sense auxiliary functions can be used to find the optimal solution. When using weak-sense auxiliary function to find solutions, adding smoothing function can improve convergence speed.
The second part is the experimental architecture, including NTNU broadcast news corpus, lexicon and language model. The recognizer uses left-to-right, frame-synchronous tree copy search to implement LVCSR. The thesis uses maximum likelihood training results of mel frequency cepstrum coefficients and features processed by heteroscedastic linear discriminant analysis as baseline.
The third part is minimum phone error. The method uses minimum phone error as direct objective function. From the update equation we can see that the newly trained model parameters are closer to correctly-recognized features (belong to numerator lattices) and move far away from wrongly-recognized features (belong to denominator lattices). The I-smoothing technique introduces model’s prior to optimize estimation. Besides, the thesis will introduce the approximation of phone error-how to use lattice to approximate all recognized results and how to use forward-backward algorithms to calculate average accuracy. The experimental results show that this method can reduce 3% character error rate in the corpus.
The fourth part is the feature-space minimum phone error. The method projects features into high-dimension space and generate an offset vector added to original feature and leads to discrimination. The transform matrix is trained by minimum phone error followed by gradient descent to do update. There are direct differential and indirect differential. Indirect differential can reflect the model change on features so that feature training and model training can be done iteratively.
Offset feature-space minimum phone error is different in the high dimension feature. The method can save 1/4 computation and achieve similar improvement. My thesis proposed dimension-weighted offset feature-space minimum phone error which treats different dimensions with different weights. Experimental results show that theses methods have 3% character error rate reduction. Dimension-weighted offset feature-space minimum phone error has larger improvements and more robust in training.
Subjects
最小音素錯誤
minimum phone error
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-95-R93942027-1.pdf
Size
23.31 KB
Format
Adobe PDF
Checksum
(MD5):4f89976ed49a031f2ed53f7fed645511
