Several different parametric representations of speech derived from the linear prediction model are examined for their effectiveness for automatic recognition of speakers from their voices. Twelve predictor coefficients were determined approximately once every 50 msec from speech sampled at 10 kHz. The predictor coefficients and other speech parameters derived from them, such as the impulse response function, the autocorrelation function, the area function, and the cepstrum function were used as input to an automatic speaker-recognition system. The speech data consisted of 60 utterances, consisting of six repetitions of the same sentence spoken by 10 speakers. The identification decision was based on the distance of the test sample vector from the reference vector for different speakers in the population; the speaker corresponding to the reference vector with the smallest distance was judged to be the unknown speaker. In verification, the speaker was verified if the distance between the test sample vector and the reference vector for the claimed speaker was less than a fixed threshold. Among all the parameters investigated, the cepstrum was found to be the most effective, providing an identification accuracy of 70% for speech 50 msec in duration, which increased to more than 98% for a duration of 0.5 sec. Using the same speech data, the verification accuracy was found to be approximately 83% for a duration of 50 msec, increasing to 98% for a duration of 1 sec. In a separate study to determine the feasibility of text-independent speaker identification, an identification accuracy of 93% was achieved for speech 2 sec in duration even though the texts of the test and reference samples were different.
We present numerical methods for studying the relationship between the shape of the vocal tract and its acoustic output. For a stationary vocal tract, the articulatory-acoustic relationship can be represented as a multidimensional function of a multidimensional argument: y=f(x), where x, y are vectors describing the vocal-tract shape and the resulting acoustic output, respectively. Assuming that y may be computed for any x, we develop a procedure for inverting f(x). Inversion by computer sorting consists of computing y for many values of x and sorting the resulting (y,x) pairs into a convenient order according to y; x for a given y is then obtained by looking up y in the sorted data. Application of this method for determining parameters of an articulatory model corresponding to a given set of formant frequencies is presented. A method is also described for finding articulatory regions (fibers) which map into a single point in the acoustic space. The local nature of f(x) is determined by linearization in a small neighborhood. Larger regions are explored by extending the linear neighborhoods in small steps. This method was applied for the study of compensatory articulation. Sounds produced by various articulations along a fiber were synthesized and were compared by informal listening tests. These tests show that, in many cases of interest, a given sound could be produced by many different vocal-tract shapes.
Absb-act-In speech analysis, the voiced-unvoiced decision is usually performed in conjunction with pitch analysis. The linking of voicedunvoiced (V-UV) decision to pitch analysis not only results in unnecessary complexity, but makes it difficult to classify short speech segments which are less than a few pitch periods in duration. In this paper, we describe a pattern recognition approach for deciding whether a given segment of a speech signal should be classified as voiced speech, unvoiced speech, or silence, based on measurements made on the signal. In this method, five different measurements axe made on the speech segment to be classified. The measured parameters are the zero-crossing rate, the speech energy, the correlation between adjacent speech samples, the first predictor coefficient from a 12-pole linear predictive coding (LPC) analysis, and the energy in the prediction error. The speech segment is assigned to a particular class based on a minimumdistance rule obtained under the assumption that the measured parameters are distributed according to the multidimensional Gaussian probability density function. The means and covariances for the Gaussian distribution are determined from manually classified speech data included in a training set. The method has been found to provide reliable classification with speech segments as short as 10 ms and has been used for both speech analysis-synthesis and recognition applications. A simple nonlinear smoothing algorithm is described to provide a smooth 3-level contour of an utterance for use in speech recognition applications. Quantitative results and several examples illustrating the performance of the methOd are included in the paper.
Lmear predicuve coding (LPC) parameters are widely used in various speech processmg applications for representing the spectral envelope mformation of speech. For low bit rate speech coding applications, it IS tmportant to quantize these parameters accurately using as few bits as possible wtthout sacriiicmg the speech quality. Though the vector quanuzers are more efficient than the scalar quantizers, their use for tine quantizauon of LPC information (using 24-26 bits/frames) is impeded due to their prohibmvely high complexity. In this paper, a split vector quantization approach IS used to overcome the complexity problem. Here, the LPC vector consisting of 10 line spectral frequencies (LSFs) is dtvided mto two parts and each part is quarmzed separately usmg vector quantization. Using the localized spectral sensttivttyproperty of the LSF parameters, a weighted LSF distance measure ts proposed. Using this dtstance measure, it is shown that the split vector quantrzer can quantize LPC information in 24 bits/frame with 1 dB average spectral distortion and < 2% outher frames (having spectral distortron greater than 2 dB).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.