In this paper we describe the major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented.
A sinusoidal model for the speech waveform is used to develop a new analysislsynthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves. These parameters are estimated from the short-time Fourier transform using a simple peak-picking algorithm. Rapid changes in the highly resolved spectral components are tracked using the concept of "birth" and "death" of the underlying sine waves. For a given frequency track a cubic function is used to unwrap and interpolate the phase such that the phase track is m,aximally smooth. This phase function is applied to a sine-wave generator, which is amplitude modulated and added to the other sine waves to give the final speech output. The resulting synthetic waveform preserves the general waveform shape and is essentially perceptually indistinguishable from the original speech. Furthermore, in the presence of noise the perceptual characteristics of the speech as well as the noise are maintained. In addition, it was found that the representation was sufficiently general that high-quality reproduction was obtained for a larger class of inputs including: two overlapping, superposed speech waveforms; music waveforms; speech in musical backgrounds; and certain marine biologic sounds. Finally, the analysis/synthesis system forms the basis for new approaches to the problems of speech transformations including timescale and pitch-scale modification, and midrate speech coding [SI, [9].
Speech production has long been viewed as a linear filtering process, as described by Fant in the late 1950's [10]. The vocal tract, which acts as the filter, is the primary focus of most speech work. This thesis develops a method for estimating the source of speech, the glottal flow derivative. Models are proposed for the coarse and fine structure of the glottal flow derivative, accounting for nonlinear sourcefilter interaction, and techniques are developed for estimating the parameters of these models. The importance of the source is demonstrated through speaker identification experiments.The glottal flow derivative waveform is estimated from the speech signal by inverse filtering the speech with a vocal tract estimate obtained during the glottal closed phase. The closed phase is determined through a sliding covariance analysis with a very short time window and a one sample shift. This allows calculation of formant motion within each pitch period predicted by Ananthapadmanabha and Fant to be a result of nonlinear source-filter interaction during the glottal open phase [1]. By identifying the timing of formant modulation from the formant tracks, the timing of the closed phase can be determined. The glottal flow derivative is modeled using an LF model to capture the coarse structure, while the fine structure is modeled through energy measures and a parabolic fit to the frequency modulation of the first formant.The model parameters are used in the Reynolds Gaussian Mixture Model Speaker Identification system with excellent results for non-degraded speech. Each category of source features is shown to contain speaker dependent information, while the combination of source and filter parameters increases the overall accuracy for the system. For a large dataset, the coarse structure parameters achieve 60% accuracy, the fine structure parameters give 40% accuracy, and their combination yields 70% correct identification. When combined with vocal tract features, the accuracy increases to 93%, slightly above the accuracy achieved with just vocal tract information. On smaller datasets of telephone-degraded speech, accuracy increases up to 20% when source features are added to traditional mel-cepstral measures. Perhaps one of his best contributions is that he and I tend to think of things from different angles, hopefully I will carry his viewpoint along with my own after I leave MIT. I would also like to thank Doug Reynolds for helping me understand and use his speaker identification system. Thanks are also owed to all the members of the Speech System Technology Group who have helped me in so many ways. I would also like to thank all the people who thought it would be a shame for me to not get an advanced degree and all my friends in Seattle who signed their letters "quit school."These conflicting views enabled me to make up my own mind with a minimum of outside pressure.And, of course, I wish to thank the sponsors of my research and my time here at MIT. I would like to thank the EECS department for awarding me a fe...
Auditory attention decoding (AAD) through a brain-computer interface has had a flowering of developments since it was first introduced by Mesgarani and Chang (2012) using electrocorticograph recordings. AAD has been pursued for its potential application to hearing-aid design in which an attention-guided algorithm selects, from multiple competing acoustic sources, which should be enhanced for the listener and which should be suppressed. Traditionally, researchers have separated the AAD problem into two stages: reconstruction of a representation of the attended audio from neural signals, followed by determining the similarity between the candidate audio streams and the reconstruction. Here, we compare the traditional two-stage approach with a novel neural-network architecture that subsumes the explicit similarity step. We compare this new architecture against linear and non-linear (neural-network) baselines using both wet and dry electroencephalogram (EEG) systems. Our results indicate that the new architecture outperforms the baseline linear stimulus-reconstruction method, improving decoding accuracy from 66% to 81% using wet EEG and from 59% to 87% for dry EEG. Also of note was the finding that the dry EEG system can deliver comparable or even better results than the wet, despite the latter having one third as many EEG channels as the former. The 11-subject, wet-electrode AAD dataset for two competing, co-located talkers, the 11-subject, dry-electrode AAD dataset, and our software are available for further validation, experimentation, and modification.
Abstract-This paper develops a multiband or wavelet approach for capturing the AM-FM components of modulated signals immersed in noise. The technique utilizes the recentlypopularized nonlinear energy operator Y (s) = (S)' -ss to isolate the AM-FM energy, and an energy separation algorithm (ESA) to extract the instantaneous amplitudes and frequencies. It is demonstrated that the performance of the energy operator/ ESA approach is vastly improved if the signal is first filtered through a bank of bandpass filters, and at each instant analyzed (via I and the ESA) using the dominant local channel response. Moreover, it is found that uniform (worst-case) performance across the frequency spectrum is attained by using a constant-Q, or multiscale wavelet-like filter bank.The elementary stochastic properties of and of the ESA are developed first. The performance of P and the ESA when applied to bandpass filtered versions of an AM-FM signat-plusnoise combination is then analyzed. The predicted performance is greatly improved by filtering, if the local signal frequencies occur in-hand. These observations motivate the multiband energy operator and ESA approach, ensuring the in-band analysis of local AM-FM energy. In particular, the multi-bands must have the constant-Q or wavelet scaling property to ensure uniform performance across bands. The theoretical predictions and the simulation results indicate that improved practical strategies are feasible for tracking and identifying AM-FM components in signals possessing pattern coherencies manifested as local concentrations of frequencies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.