As part of our effort in developing a spoken language system for interactive problem solving, we recently collected a sizeable amount of speech data. This database is composed of spontaneous sentences which were collected during a simulated human/machine dialogue. Since a computer log of the spoken dialogue was maintained, we were able to ask the subjects to provide read versions of the sentences as well. This paper documents the data collection process, and provides some preliminary analyses of the collected data.
Extemporaneously generated speech often contains verbal hesitations, filled pauses and unfilled pauses, reflecting the speaker’s uncertainty in formulating sentences, as in ‘‘Where iiiiiis um the nearest bank.’’ This study attempts to describe their acoustic properties using a subset, consisting of 3167 utterances from 66 speakers, of the spontaneous speech voyager urban navigation corpus, based on time-aligned orthographic and phonetic transcriptions. There are 564 verbal hesitations, 2518 unfilled pauses, and 148 filled pauses, and they are concentrated in 49.6% of the corpus utterances. 74.4% of the unfilled pauses occur in isolation, and their durations are longer when they cooccur with verbal hesitations and filled pauses by 46.1% and 363.9%, respectively. Over 70% of the verbal hesitations and filled pauses are followed by unfilled pauses, and they are longer than their isolated counterparts. Thus the results suggest that there may be a mutually enforcing effect among these acoustic events. An attempt has been made to identify verbal hesitations and filled pauses based on relative duration, proximity to silence, and relative mean F0 , using regression tree analyses, and a classification accuracy of approximately 70% on unseen data has been achieved.
Because of acoustic similarities between some letters of the alphabet, automatic recognition of continuously spoken letters is a difficult task. The goal of this study is to determine and compare how well listeners and spectrogram readers can recognize continuously spoken letter strings from multiple speakers. The interest in spectrogram reading results is motivated by the belief that this procedure may help to identify acoustic attributes and decision strategies that are useful for system implementation. Listening and spectrogram reading tests involving eight listeners and six spectrogram readers, respectively, were conducted using a corpus of 1000 wordlike strings designed to minimize the use of lexical knowledge. Results show that listeners' performance was better than readers' (98.4% vs 91.0%). In both experiments, string lengths were determined very accurately (98.1% and 96.2%), presumably due to the large number of glottal stops inserted at letter boundaries to facilitate segmentation. Most of the errors were due to substitution of one letter for another (68% and 92%), and they generally fall into two categories. Asymmetric errors can often be attributed to subjects' disregard for contextual influence, whereas symmetric errors are largely due to acoustic similarities between certain letter pairs. Subsequent acoustic study of four of the most confusable letter pairs has resulted in the identification of a number of distinguishing acoustic attributes. Using these attributes, overall recognition performance better than that of the readers was achieved. [Work supported by NSF and DARPA under contract N00014-82-K-0727, monitored through the Office of Naval Research.]
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.