This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modelling and model training, language and pronunciation modelling are presented. These include the use of conversation side based cepstral normalisation, vocal tract length normalisation, heteroscedastic linear discriminant analysis for feature projection, Minimum Phone Error Training and speaker adaptive training, latticebased model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation and class based language models.The transcription system developed for participation in the 2002 NIST Rich Transcription evaluations of English conversational telephone speech data is presented in detail. In this evaluation the CU-HTK system gave an overall word error rate of 23.9%, which was the best performance by a statistically significant margin. Further details on the derivation of faster systems with moderate performance degradation are discussed in the context of the 2002 CU-HTK 10×RT conversational speech transcription system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.