Despite the lack of invariance problem (the many-to-many mapping between acoustics and percepts), human listeners experience phonetic constancy and typically perceive what a speaker intends. Most models of human speech recognition (HSR) have side-stepped this problem, working with abstract, idealized inputs and deferring the challenge of working with real speech. In contrast, carefully engineered deep learning networks allow robust, real-world automatic speech recognition (ASR). However, the complexities of deep learning architectures and training regimens make it difficult to use them to provide direct insights into mechanisms that may support HSR. In this brief article, we report preliminary results from a two-layer network that borrows one element from ASR, long short-term memory nodes, which provide dynamic memory for a range of temporal spans. This allows the model to learn to map real speech from multiple talkers to semantic targets with high accuracy, with human-like timecourse of lexical access and phonological competition. Internal representations emerge that resemble phonetically organized responses in human superior temporal gyrus, suggesting that the model develops a distributed phonological code despite no explicit training on phonetic or phonemic targets. The ability to work with real speech is a major advance for cognitive models of HSR.
Efficient speech perception requires listeners to maintain an exquisite tension between stability of the language architecture and flexibility to accommodate variation in the input, such as that associated with individual talker differences in speech production. Achieving this tension can be guided by top-down learning mechanisms, wherein lexical information constrains interpretation of speech input, and by bottom-up learning mechanisms, in which distributional information in the speech signal is used to optimize the mapping to speech sound categories. An open question for theories of perceptual learning concerns the nature of the representations that are built for individual talkers: do these representations reflect long-term, global exposure to a talker or rather only short-term, local exposure? Recent research suggests that when lexical knowledge is used to resolve a talker's ambiguous productions, listeners disregard previous experience with a talker and instead rely on only recent experience, a finding that is contrary to predictions of Bayesian belief-updating accounts of perceptual adaptation. Here we use a distributional learning paradigm in which lexical information is not explicitly required to resolve ambiguous input to provide an additional test of global versus local exposure accounts. Listeners completed two blocks of phonetic categorization for stimuli that differed in voiceonset-time, a probabilistic cue to the voicing contrast in English stop consonants. In each block, two distributions were presented, one specifying /g/ and one specifying /k/. Across the two blocks, variance of the distributions was manipulated to be either narrow or wide. The critical manipulation was order of the two blocks; half of the listeners were first exposed to the narrow distributions followed by the wide distributions, with the order reversed for the other half of the listeners. The results showed that for earlier trials, the identification slope was steeper for the narrow-wide group compared to the wide-narrow group, but this difference was attenuated for later trials. The between-group convergence was driven by an asymmetry in learning between the two orders such that only those in the narrow-wide group showed slope movement during exposure, a pattern that was mirrored by computational simulations in which the distributional statistics of the present talker were integrated with prior experience with English. This pattern of results suggests that listeners did not disregard all prior experience with the talker, and instead used cumulative exposure to guide phonetic decisions, which raises the possibility that accommodating a talker's phonetic signature entails maintaining representations that reflect global experience.
Purpose Speech perception is facilitated by listeners' ability to dynamically modify the mapping to speech sounds given systematic variation in speech input. For example, the degree to which listeners show categorical perception of speech input changes as a function of distributional variability in the input, with perception becoming less categorical as the input, becomes more variable. Here, we test the hypothesis that higher level receptive language ability is linked to the ability to adapt to low-level distributional cues in speech input. Method Listeners ( n = 58) completed a distributional learning task consisting of 2 blocks of phonetic categorization for words beginning with /g/ and /k/. In 1 block, the distributions of voice onset time values specifying /g/ and /k/ had narrow variances (i.e., minimal variability). In the other block, the distributions of voice onset times specifying /g/ and /k/ had wider variances (i.e., increased variability). In addition, all listeners completed an assessment battery for receptive language, nonverbal intelligence, and reading fluency. Results As predicted by an ideal observer computational framework, the participants in aggregate showed identification responses that were more categorical for consistent compared to inconsistent input, indicative of distributional learning. However, the magnitude of learning across participants showed wide individual variability, which was predicted by receptive language ability but not by nonverbal intelligence or by reading fluency. Conclusion The results suggest that individual differences in distributional learning for speech are linked, at least in part, to receptive language ability, reflecting a decreased ability among those with weaker receptive language to capitalize on consistent input distributions.
A fundamental goal of research in the domain of speech perception has been to describe how listeners resolve the lack-of-invariance problem in order to achieve stable word recognition. Here we review work from our laboratory and others that has examined the representational nature of prelexical and lexical knowledge by considering the degree to which listeners customize the mapping from the acoustic signal to meaning on a talker-specific basis. One central finding is that while talker-specificity effects in speech perception are observed frequently, they are not absolute, and seem to be influenced by rich interactions within the cognitive and language architectures. We consider these findings with respect to their implications for abstract and episodic accounts of spoken word recognition.
Listeners show heightened talker recognition for native compared to nonnative speech, formalized as the language familiarity effect (LFE) for voice recognition. Some findings suggest that language comprehension is the locus of the LFE, while others implicate expertise with the linguistic sound structure. These hypotheses yield different predictions for the LFE with time-reversed speech, a manipulation that precludes lexical access but preserves some indexical and phonetic properties. Research to date shows discrepant results for the LFE with this impoverished signal. Here we reconcile this discrepancy by examining how the amount of exposure to talkers’ voices influences the LFE for time-reversed speech. Three experiments were conducted. In all, two groups of English monolinguals were trained and then tested on the identification of four English talkers and four French talkers; one group heard natural speech and the other group heard time-reversed speech. Across the experiments, we manipulated exposure to the voices in terms of number of training trials and duration of the talkers’ sentences. A robust LFE emerged in all cases, though the magnitude was attenuated as the amount of exposure decreased. These results are consistent with the account that the LFE for talker identification is linked to the sound structure of language.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.