This talk presents phonetic models that capture both the dynamic characteristics and the statistical dependencies of acoustic attributes in a segment-based framework. The approach is based on the creation of a track, Tα, for each phonetic unit α. The track serves as a model of the dynamic trajectories of the acoustic attributes over the segment. The statistical framework for scoring incorporates the auto- and cross-correlation properties of the track error over time, within a segment. On a vowel classification task [W. Goldenthal and J. Glass, ‘‘Modeling Spectra Dynamics for Vowel Classification,’’ Proc. Eurospeech 93, pp. 289–292, Berlin, Germany (1993)], this methodology achieved classification performance of 68.9%. This result compares favorably with other studies using the timit corpus. This talk extends this result by presenting context-independent and context-dependent experiments for all the phones. Context-independent classification performance of 76.8% is demonstrated. The key to implementing the context-dependent classifier consists of merging tracks trained separately on left and right contexts to synthesize any desired context during classification. This method allows one to synthesize a track for triphone contexts not seen in the training set. Using a total of 4167 gender-dependent biphone tracks, 58 phonetic statistical models, and no phone grammar, a context-dependent classification performance of 80.5% was achieved. This result increases to 85.8% when a trigram phone grammar is added.
This paper describes a approach to speech segmentation. Unlike approaches based on spectral measurements, our algorithm iteratively clusters on an LPC representation of time waveform blocks. The algorithm uses a generalized maximum likelihood criterion for deciding when two neighboring pieces of the signal should be joined. This paper describes the algorithm and shows that it yields superior results when compared to metrics based on spectral or cepstral measurements. BackgroundSegmentation of speech i n to phonetic components signicantly accelerates search in segment-based speech systems. Given a list of possible segment boundaries, the recognizer need only hypothesize a subset of all possible start and end times for any phonetic segment. This leads to a substantial reduction in computation, even with high insertion rates, because the segmental search i s O ( n 2 ) in the number of boundaries. However, to realize this gain in performance, the segmentation must be done computationally eciently and without much prior knowledge.There are three main techniques that have been explored in the past for segmentation. The rst technique applies a sequential likelihood ratio test directly to the time series [1,2]. Since the test is not symmetric, good performance is only obtained by testing the signal for changes both forward and backward in time. Andre-Obrecht obtained an insertion rate of 120% and a deletion rate of 2.8% on a test set of 1534 French phonemes. This work was later extended in [4] and [5], both of which eliminated the need for a backward component of the test. Unfortunately this approach i s v ery computationally intensive because it works directly on each individual signal measurement.Another approach to segmentation is to use a smooth derivative operator, such as the Canny edge detector [3], on the signal's spectral representation. However, Glass [6] found that a clustering technique provides superior performance. Glass used a clustering algorithm on the outputs of Sene's auditory model. The reported insertion and deletion rates of 5% and 3.5% respectively, are computed over the optimal alignment b e t w een a multi-scale dendrogram and the known phonetic transcription. This optimizing search signicantly improves the results that would otherwise be achieved if the clusters obtained at any single level in the dendrogram are used as the segmentation. In section 3. we report on experiments we conducted using this technique that permit a direct comparison. Cluster Based SegmentationWe implemented a single clustering algorithm and experimented with several distance metrics including those used in [6]. The clustering algorithm requires two things:1. An appropriate representation of the speech signal for clustering. 2. A distance metric for computing the distance between two clusters.The clustering algorithm works as follows. Let fyig be a sequence of n observation vectors. These could be the time series measurements, spectral measurement v ectors, or cepstral measurement v ectors. Divide this sequence ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.