Recently, maxout networks have brought significant improvements to various speech recognition and computer vision tasks. In this paper we introduce two new types of generalized maxout units, which we call p-norm and soft-maxout. We investigate their performance in Large Vocabulary Continuous Speech Recognition (LVCSR) tasks in various languages with 10 hours and 60 hours of data, and find that the p-norm generalization of maxout consistently performs well. Because, in our training setup, we sometimes see instability during training when training unbounded-output nonlinearities such as these, we also present a method to control that instability. This is the "normalization layer", which is a nonlinearity that scales down all dimensions of its input in order to stop the average squared output from exceeding one. The performance of our proposed nonlinearities are compared with maxout, rectified linear units (ReLU), tanh units, and also with a discriminatively trained SGMM/HMM system, and our p-norm units with p equal to 2 are found to perform best.
Traditional i-vector speaker recognition systems use a Gaussian mixture model (GMM) to collect sufficient statistics (SS). Recently, replacing this GMM with a deep neural network (DNN) has shown promising results. In this paper, we explore the use of DNNs to collect SS for the unsupervised domain adaptation task of the Domain Adaptation Challenge (DAC). We show that collecting SS with a DNN trained on out-of-domain data boosts the speaker recognition performance of an out-of-domain system by more than 25%. Moreover, we integrate the DNN in an unsupervised adaptation framework, that uses agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration, and show that the initial gains of the out-of-domain system carry over to the final adapted system. Despite the fact that the DNN is trained on the out-of-domain data, the final adapted system produces a relative improvement of more than 30% with respect to the best published results on this task.
Provides an overview of a speech-to-text (STT) and keyword search (KWS) system architecture build primarily on the top of the Kaldi toolkit and expands on a few highlights. The system was developed as a part of the research efforts of the Radical team while participating in the IARPA Babel program. Our aim was to develop a general system pipeline which could be easily and rapidly deployed in any language, independently on the language script and phonological and linguistic features of the language.Index Terms-Kaldi, spoken term detection, keyword search, speech recognition, deep neural networks, pitch, IARPA BABEL, OpenKWS BACKGROUNDThe IARPA BABEL program aims to achieve the capability to rapidly develop speech-to-text (STT) and keyword search (KWS) systems in new languages with limited linguistic resources-transcribed speech, pronunciation lexicon and matched text-with emphasis on conversational speech.The four BABEL program participants were evaluated by NIST via two benchmark tests: on five development languages and on a surprise language revealed only at the beginning of the evaluation period. The development languages were Assamese, Bengali, Haitian Creole, Lao and Zulu, and the surprise language was Tamil. Eight additional teams worldwide participated in the surprise language evaluation.The primary 2014 evaluation was on KWS performance using systems trained on an IARPA-provided limited language pack (LimitedLP) containing 10 hours of transcribed speech, a dictionary that covered words in the transcripts, 70 hours of un-transcribed speech for unsupervised training, and 10 hours of transcribed speech for development-testing. A secondary evaluation was on KWS performance using a full language pack (FullLP), in which transcripts and dictionary entries were provided for an additional 50 of the 70 hours of un-transcribed speech: total 60 hours transcribed. 1The test data provided by NIST contained 15 hours of speech for each development language, 75 hours for the surprise language, and a list of ca 3000 keywords for each language. The primary KWS evaluation metric was actual term weighted value (ATWV), and the BABEL program goal for 2014 was to attain an ATWV of 0.30 in the LimitedLP training condition on all six languages. This paper describes the system submitted to NIST by the JHU Kaldi team. It is expected to interest readers because the submitted system attained all the program goals, enabling the RADICAL team to achieve third place worldwide, and because 9 of the top 10 participants in the NIST evaluation used Kaldi components/recipes 2 in their submitted system. JHU KALDI SYSTEMS OVERVIEWThe Kaldi KWS system is comprised of LVCSR based lattice generation followed by OpenFST based indexing and keyword search. LVCSR systems based on four different acoustic models are used to decode and index the speech:
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.