A three-tone sinusoidal replica of a naturally produced utterance was identified by listeners, despite the readily apparent unnatural speech quality of the signal. The time-varying properties of these highly artificial acoustic signals are apparently sufficient to support perception of the linguistic message in the absence of traditional acoustic cues for phonetic segments.
Accounts of the identification of words and talkers commonly rely on different acoustic properties. To identify a word, a perceiver discards acoustic aspects of an utterance that are talker specific, forming an abstract representation of the linguistic message with which to probe a mental lexicon. To identify a talker, a perceiver discards acoustic aspects of an utterance specific to particular phonemes, creating a representation of voice quality with which to search for familiar talkers in long-term memory. In 3 experiments, sinewave replicas of natural speech sampled from 10 talkers eliminated natural voice quality while preserving idiosyncratic phonetic variation. Listeners identified the sinewave talkers without recourse to acoustic attributes of natural voice quality. This finding supports a revised description of speech perception in which the phonetic properties of utterances serve to identify both words and talkers.
How does a perceiver resolve the linguistic properties of an utterance? This question has motivated many investigations within the study of speech perception and a great variety of explanations. In a retrospective summary 15 years ago, Klatt (1989) reviewed a large sample of theoretical descriptions of the perceiver's ability to project the sensory effects of speech, exhibiting inexhaustible variety, into a finite and small number of linguistically defined attributes, whether features, phones, phonemes, syllables, or words. Although he noted many distinctions among the accounts, with few exceptions they exhibited a common feature. Each presumed that perception begins with a speech signal, well-composed and fit to analyze. This common premise shared by otherwise divergent explanations of perception obliges the models to admit severe and unintended constraints on their applicability. To exist within the limits set by this simplifying assumption, the models are restricted to a domain in which speech is the only sound; moreover, only a single talker ever speaks at once. Although this designation is easily met in laboratory samples, it is safe to say that it is rare in vivo. Moreover, in their exclusive devotion to the perception of speech the models are tacitly modular (Fodor, 1983), whether or not they acknowledge it.Despite the consequences of this dedication of perceptual models to speech and speech alone, there has been a plausible and convenient way to persist in invoking the simplifying assumption. This fundamental premise survives intact if a preliminary process of perceptual organization finds a speech signal, follows its patterned variation amid the effects of other sound sources, and delivers it whole and ready to analyze for linguistic properties. The indifference to the conditions imposed by the common perspective reflects an apparent consensus that perceptual organization of speech is simple, automatic, and accomplished by generic means. However, despite the rapidly established perceptual coherence of the constituents of a speech signal, the perceptual organization of speech cannot be reduced to the available and well-established principles of auditory perceptual organization.
In two experiments, subjects monitored sequences of spoken consonant-vowel-consonant words and nonwords for a specified initial phoneme. In Experiment I. the target-carrying monosyllables were embedded in sequences in which the monosyllables were all words or all nonwords. The possible contextual bias of Experiment I was minimized in Experiment II through a random mixing of target-earrying words and nonwords with foil words and nonwords. Target-carrying words were distinguished in both experiments from target-carrying nonwords only in the final consonant, e.g., fbitl vs. fbip/. In both experiments, subjects detected the specified consonant fbi significantly faster when it began a word than when it began a nonword. One interpretation of this result is that in speech perception lexical information is accessed before phonological information. This interpretation was questioned and preference was given to the view that the result reflected processes subsequent to perception: words become available to awareness faster than nonwords and therefore provide a basis for differential responding that much sooner.It is commonplace to conceptualize the process of pattern identification as a hierarchically organized sequence of operations that maps the structured energy at the receptors onto increasingly more abstract representations. In its most simplistic form, this conception characterizes the "conversation" between representations as unidirectional; that is, a more abstract representation is constructed with reference to a less abstract representation, but not vice versa. There are, however, a number of curious results that question the integrity of this characterization. By way of example, a briefly exposed and masked letter is recognized more accurately when part of a word than when part of a nonword (Wheeler, 1970; Reicher, Note 1). Other, related results suggest that this is a fairly general phenomenon. Thus, detection of an oriented line is significantly better when it is part of a briefly exposed, and masked, unitary picture of a well-formed threedimensional object than when it is a part of a picture portraying a less well-formed, and flat, arrangement of lines (Weisstein & Harris, 1974). As revealed in the work of Biederman and his colleagues
Our studies revealed two stable modes of perceptual organization, one based on attributes of auditory sensory elements and another based on attributes of patterned sensory variation composed by the aggregation of sensory elements. In a dual-task method, listeners attended concurrently to both aspects, component and pattern, of a sine wave analogue of a word. Organization of elements was indexed by several single-mode tests of auditory form perception to verify the perceptual segregation of either an individual formant of a synthetic word or a tonal component of a sinusoidal word analogue. Organization of patterned variation was indexed by a test of lexical identification. The results show the independence of the perception of auditory and phonetic form, which appear to be differently organized concurrent effects of the same acoustic cause.
In this study, downward-directed mechanical perturbations were applied to the lower lip during both repetitive (/...paepaepae.../) and discrete (/pe'saepaepl/) utterances in order to examine the perturbation-induced changes of intergestural timing between syllables (i.e., between the bilabial and laryngeal gestures for successive /p/'s) and within phonemes (i.e., between the bilabial and laryngeal gestures within single /p/'s). Our findings led us to several conclusions. First, steady-state (phase-resetting) analyses of the repetitive utterances indicated both that "permanent" phase shifts existed for both the lips and the larynx after the system returned to its pre-perturbation rhythm and that smaller steady-state shifts occurred in the relative phasing of these gestures. These results support the hypothesis that central intergestural dynamics can be reset by peripheral articulatory events. Such resetting was strongest when the perturbation was delivered within a "sensitive phase" of the cycle, during which the downwardly directed lower-lip perturbation opposed the just-initiated, actively controlled bilabial closing gesture for /p/. Although changes in syllable duration were found for other perturbed phases, these changes were simply transient effects and did not indicate a resetting of the central "clock." Second, analyses of the transient portions of the perturbed cycles of the repetitive utterances indicated that the perturbation-induced steady-state phase shifts are almost totally attributable to changes occurring during the first two perturbed cycles. Finally, the transient changes in speech timing induced by perturbations in the discrete sequences appeared to share a common dynamical basis with the changes to the repetitive sequences. We conclude by speculating on the type of dynamical system that could generate these temporal patterns.
The personal attributes of a talker perceived via acoustic properties of speech are commonly considered to be an extralinguistic message of an utterance. Accordingly, accounts of the perception of talker attributes have emphasized a causal role of aspects of the fundamental frequency and coarsegrain acoustic spectra distinct from the detailed acoustic correlates of phonemes. In testing this view, in four experiments, we estimated the ability of listeners to ascertain the sex or the identity of 5 male and 5 female talkers from sinusoidal replicas of natural utterances, which lack fundamental frequency and natural vocal spectra. Given such radically reduced signals, listeners appeared to identify a talker's sex according to the central spectral tendencies of the sinusoidal constituents. Under acoustic conditions that prevented listeners from determining the sex of a talker, individual identification from sinewave signals was often successful. These results reveal that the perception of a talker's sex and identity are not contingent and that fine-grain aspects of a talker's phonetic production can elicit individual identification under conditions that block the perception of voice quality.What can a listener perceive in the speech of an unfamiliar talker? Even a brief utterance can convey a linguistic message and something about the talker who produced it. Although the perception ofpersonal attributes has commonly been explained by an account separate from the perception of linguistic properties, a recent study has shown that phonetic details can also be used to identify talkers and to distinguish them from one another (Remez, Fellowes, & Rubin, 1997). Surprisingly, when acoustic test materials forced performance to depend on phonetic attributes, listeners occasionally mistook male talkers for female talkers, and vice versa. The present report describes a series of experiments intended to clarify the interpretation ofthis counterintuitive finding, posing these questions: (1) Is the sex of a talker identifiable in a sine wave utterance replica? (2) Are differences across talkers in the central spectral tendency ofthe sinusoidal constituents responsible for differing impressions of the sex of a sine wave talker? (3) Are individuals identifiable under acoustic conditions that preclude the identification of sex?Many studies of talker recognition by ear, by automatic classification, or by visual inspection of spectroThis research was supported by Grants DC00308 (to R.E.R.) and HDOl994 (to Haskins Laboratories) from the National Institutes of Health.The authors gratefully acknowledge the meticulous assistance and trenchant advice of Chris Darwin, Steve Goldinger, Harry Levitt, Jennifer Lipton, Larry Rosenblum, Jim Sawusch, Dalia Shoretz, Saskia Smith, Steve Stroessner, Doug Whalen, and Fay Xing. Correspondence should be addressed to R. E. Remez, Department of Psychology, Barnard College, 3009 Broadway, New York, NY 10027-6598 (e-mail: remez@paradise.barnard.columbia.edu).grams have sought to tie variation across individu...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.