Frequency of vibration has not been widely used as a parameter for encoding speech-derived information on the skin. Where it has been used, the frequencies employed have not necessarily been compatible with the capabilities of the tactile channel, and no determination was made of the information transmitted by the frequency variable, as differentiated from other parameters used simultaneously, such as duration, amplitude, and location. However, several investigators have shown that difference limens for vibration frequency may be small enough to make stimulus frequency useful in encoding a speech-derived parameter such as the fundamental frequency of voiced speech. In the studies reported here, measurements have been made of the frequency discrimination ability of the volar forearm, using both sinusoidal and pulse waveforms. Stimulus configurations included the constant-frequency vibrations used by other laboratories as well as frequency-modulated (warbled) stimulus patterns. The frequency of a warbled stimulus was designed to have temporal variations analogous to those found in speech. The results suggest that it may be profitable to display the fundamental frequency of voiced speech on the skin as vibratory frequency, thought it might be desirable to recode fundamental frequency into a frequency range more closely matched to the skin's capability.
The first three formants, i.e., the first three spectral prominences of the short-time magnitude spectra, have been the most commonly used acoustic cues for vowels ever since the work of Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)]. However, spectral shape features, which encode the global smoothed spectrum, provide a more complete spectral description, and therefore might be even better acoustic correlates for vowels. In this study automatic vowel classification experiments were used to compare formants and spectral-shape features for monopthongal vowels spoken in the context of isolated CVC words, under a variety of conditions. The roles of static and time-varying information for vowel discrimination were also compared. Spectral shape was encoded using the coefficients in a cosine expansion of the nonlinearly scaled magnitude spectrum. Under almost all conditions investigated, in the absence of fundamental frequency (F0) information, automatic vowel classification based on spectral-shape features was superior to that based on formants. If F0 was used as an additional feature, vowel classification based on spectral shape features was still superior to that based on formants, but the differences between the two feature sets were reduced. It was also found that the error pattern of perceptual confusions was more closely correlated with errors in automatic classification obtained from spectral-shape features than with classification errors from formants. Therefore it is concluded that spectral-shape features are a more complete set of acoustic correlates for vowel identity than are formants. In comparing static and time-varying features, static features were the most important for vowel discrimination, but feature trajectories were valuable secondary sources of information.
In this paper, a fundamental frequency (F(0)) tracking algorithm is presented that is extremely robust for both high quality and telephone speech, at signal to noise ratios ranging from clean speech to very noisy speech. The algorithm is named "YAAPT," for "yet another algorithm for pitch tracking." The algorithm is based on a combination of time domain processing, using the normalized cross correlation, and frequency domain processing. Major steps include processing of the original acoustic signal and a nonlinearly processed version of the signal, the use of a new method for computing a modified autocorrelation function that incorporates information from multiple spectral harmonic peaks, peak picking to select multiple F(0) candidates and associated figures of merit, and extensive use of dynamic programming to find the "best" track among the multiple F(0) candidates. The algorithm was evaluated by using three databases and compared to three other published F(0) tracking algorithms by using both high quality and telephone speech for various noise conditions. For clean speech, the error rates obtained are comparable to those obtained with the best results reported for any other algorithm; for noisy telephone speech, the error rates obtained are lower than those obtained with other methods.
A spoken language system combines speech recognition, natural language processing and h h a n interface technology. It functions by recognizing the pervn's words, interpreting the sequence of words to obtain a meaning in terms of the application, and providing an appropriate respinse back to the user. Potential applications of spoken lan 8e"systems range from simple tasks, such as retrieving informgo frdm an existing database (traffic reports, airline schedules),$to interactive problem solving tasks involving complex planning and reasoning (travel planning, traflic routing), to support for multilingual interactions. We examine eight key areas in which basic research is needed to produce spoken language systems: 1) robust speech recognition; 2) automatic training and adaptation; 3) spontaneous speech; 4) dialogue models; 5) natural language response generation; 6) speech synthesis and speech generation; 7) multilingual systems; and 8) interactive multimodal systems. In each area, we identify key research challenges, the infrastructure needed to support research, and the expected benefits. We conclude by reviewing the need for multidisciplinary research, for development of shared corpora and related resources, for computational support and for rapid communication among researchers. The successful development of this technology will increase accessibility of computers to a wide range of users, will facilitate multinational communication and trade, and will create new research specialties and jobs in this rapidly expanding area.
A comprehensive investigation of two acoustic feature sets for English stop consonants spoken in syllable initial position was conducted to determine the relative invariance of the features that cue place and voicing. The features evaluated were overall spectral shape, encoded as the cosine transform coefficients of the nonlinearly scaled amplitude spectrum, and formants. In addition, features were computed both for the static case, i.e., from one 25-ms frame starting at the burst, and for the dynamic case, i.e., as parameter trajectories over several frames of speech data. All features were evaluated with speaker-independent automatic classification experiments using the data from 15 speakers to train the classifier and the data from 15 different speakers for testing. The primary conclusions from these experiments, as measured via automatic recognition rates, are as follows: (1) spectral shape features are superior to both formants, and formants plus amplitudes; (2) features extracted from the dynamic spectrum are superior to features extracted from the static spectrum; and (3) features extracted from the speech signal beginning with the burst onset are superior to features extracted from the speech signal beginning with the vowel transition. Dynamic features extracted from the smoothed spectra over a 60-ms interval timed to begin with the burst onset appear to account for the primary vowel context effects. Automatic recognition results for the 6 stops (93.7%) based on 20 features was better than the rates obtained with human listeners for a 50-ms segment (89.9%) and only slightly worse than the rates obtained by human listeners for a 100-ms interval (96.6%). Thus the basic conclusion from our work is that dynamic spectral shape features are acoustically invariant cues for both place and voicing in initial stop consonants.
In this paper, we present a pitch detection algorithm that is extremely robust for both high quality and telephone speech.The kernel method for this algorithm is the ''NCCF or Normalized Cross Correlation" reported by David Talkin [IJ.Major innovations include: processing of the original acoustic signal and a nonlinearly processed version of the signal to partially restore very weak FO components; intelIigent peak picking to select multiple FO candidates and assign merit factors; and, incotporation of highly robust pitch contours obtained from smoothed versions of low frequency portions of spectrograms. Dynamic programming is used to fmd the "best"pitch track among all the candidates, using both local and transition costs. We evaluated our algorithm using the Keele pitch extraction reference database as "ground truth" for both "high quality" and ' 'telephone'' speech. For both types of speech, the error rates obtained are lower than the lowest reported in the literature. low frequency spectrograms are shown in the middle panel (original signal) and bottom panel (absolute value signal). The two curves overlaid on each spectrogram are explained in detaillater. This figure illustrates that FO is much more prominent in the lower panel than the middle panel (also verified by comparison with TIMIT version of same sentence). Similar effects were noted for many other sample signals, some studio quality as well. The strategy adopted was to completely process 0-7803-7402-9/02/$17 .00 «:l2002 IEEE 1-361
The clinical diagnosis of Alzheimer's disease and other dementias is very challenging, especially in the early stages. Our hypothesis is that any disease that affects particular brain regions involved in speech production and processing will also leave detectable finger prints in the speech. Computerized analysis of speech signals and computational linguistics have progressed to the point where an automatic speech analysis system is a promising approach for a low-cost non-invasive diagnostic tool for early detection of Alzheimer's disease. We present empirical evidence that strong discrimination between subjects with a diagnosis of probable Alzheimer's versus matched normal controls can be achieved with a combination of acoustic features from speech, linguistic features extracted from an automatically determined transcription of the speech including punctuation, and results of a mini mental state exam (MMSE). We also show that discrimination is nearly as strong even if the MMSE is not used, which implies that a fully automated system is feasible. Since commercial automatic speech recognition (ASR) tools were unable to provide transcripts for about half of our speech samples, a customized ASR system was developed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.