Chapter 1 Introduction 1 1.1 Problems with Very Large Codebook Discrete Systems 3 1.2 Contributions of the Thesis 4 1.3 Thesis Outline 6 Chapter 2 Introduction to Automatic Speech Recognition (ASR) 8 2.1 Definition of ASR 8 2.1.1 Statistical ASR 9 2.1.2 ASR System Performance Evaluation Criterion 2.2 Hidden Markov Model (HMM) for ASR 2.2.1 Dynamic Features for HMM 2.2.2 Training of and Recognition with HMM in ASR 2.3 Different Types of HMM in ASR 2.3.1 Continuous Density HMM (CDHMM) 2.3.2 Semi-continuous HMM (SCHMM) 2.3.3 Discrete HMM (DHMM) 2.3.4 Multiple-stream HMM v Chapter 3 Related Work in DHMM 3.1 Different Types of Codebooks 3.1.1 Codebook by Unsupervised Construction 3.1.2 Codebook by Supervised Construction LIST OF FIGURES 2.1 Phone HMM /i/: first-order 3-state left-to-right topology 2.2 Word HMM "it": constructed from phone HMMs /i/ and /t/ 2.3 Different types of HMM 2.4 Examples of 2-stream HMM systems 3.1 Codebooks produced by unsupervised and supervised construction. Each method partitions the acoustic space into 2 parts. The dots represent training samples in codebook construction, with colors representing different classes. 4.1 SVQ codebooks, where x t is a d-dimensional feature vector at frame t, and VQ(x t) is its corresponding full space VQ codeword constructed from SVQ or SQ codewords. Notice that when L = d, it becomes SQ codebooks. 4.2 An example of a 2-stream system with SVQ codebooks. Each stream is further split into 2 subvectors with 2 SVQ codebooks. 4.3 Relationship between model size and recognition error rate for HD-DHMM with SQ codebooks 4.4 One-stream SHDDHMM architecture overview. The shaded ellipse represents the global pool of bases, spanning the subspace. Each state output discrete distribution table (i.e., b i , and b j), lies in the subspace. The state-dependent weights (the dotted lines) and the global pool of bases (the shaded ellipse) are temporary parameters, which only exist during model training; the final model only stores b i and b j. 4.5 An example of a 2-stream SHDDHMM, each stream is further split into 2 subvectors. A global pool of bases is stored for each stream independently, and the number of bases in a pool is 3. 4.6 Stream weight estimation results for SHDDHMM with iterative linear programming 4.7 Operating characteristics of various SI-84 models (finding codeword time is included for discrete models.) 4.8 An example of "smoothing by adding 1" technique 5.1 An example of computer access control using speaker verification technology 5.2 GMM-UBM based speaker verification system 5.3 SV operating characteristics of various models, when the UBM and speaker models are pre-loaded to the memory. 5.4 SV operating characteristics of various models, when the UBM and speaker models are loaded on-the-fly. ix LIST OF TABLES 4.1 Experimental settings for different ASR systems 29 4.2 Bit allocation for each stream for HDDHMM in [53] 38 4.3 Baseline model performance on WSJ SI-84 42 4.4 Effect of codebook size for 4s-HDDHMM with SQ codebooks 44 4.5 Comparison of different inte...