Although researchers studying human speech recognition (HSR) and automatic speech recognition (ASR) share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of communication follows largely from the fact that research in these related fields has focused on the mechanics of how speech can be recognized. In Marr's (1982) terms, emphasis has been on the algorithmic and implementational levels rather than on the computational level. In this article, we provide a computational-level analysis of the task of speech recognition, which reveals the close parallels between research concerned with HSR and ASR. We illustrate this relation by presenting a new computational model of human spoken-word recognition, built using techniques from the field of ASR that, in contrast to current existing models of HSR, recognizes words from real speech input.
This article investigates 2 questions: (1) does the presence of background noise lead to a differential increase in the number of simultaneously activated candidate words in native and nonnative listening? And (2) do individual differences in listeners' cognitive and linguistic abilities explain the differential effect of background noise on (non-)native speech recognition? English and Dutch students participated in an English word recognition experiment, in which either a word's onset or offset was masked by noise. The native listeners outperformed the nonnative listeners in all listening conditions. Importantly, however, the effect of noise on the multiple activation process was found to be remarkably similar in native and nonnative listening. The presence of noise increased the set of candidate words considered for recognition in both native and nonnative listening. The results indicate that the observed performance differences between the English and Dutch listeners should not be primarily attributed to a differential effect of noise, but rather to the difference between native and nonnative listening. Additional analyses showed that word-initial information was found to be more important than word-final information during spoken-word recognition. When word-initial information was no longer reliably available word recognition accuracy dropped and word frequency information could no longer be used suggesting that word frequency information is strongly tied to the onset of words and the earliest moments of lexical access. Proficiency and inhibition ability were found to influence nonnative spoken-word recognition in noise, with a higher proficiency in the nonnative language and worse inhibition ability leading to improved recognition performance. (PsycINFO Database Record
All words of the languages we know are stored in the mental lexicon. Psycholinguistic models describe in which format lexical knowledge is stored and how it is accessed when needed for language use. The present article summarizes key findings in spoken‐word recognition by humans and describes how models of spoken‐word recognition account for them. Although current models of spoken‐word recognition differ considerably in the details of implementation, there is general consensus among them on at least three aspects: multiple word candidates are activated in parallel as a word is being heard, activation of word candidates varies with the degree of match between the speech signal and stored lexical representations, and activated candidate words compete for recognition. No consensus has been reached on other aspects such as the flow of information between different processing levels, and the format of stored prelexical and lexical representations. WIREs Cogn Sci 2012, 3:387–401. doi: 10.1002/wcs.1178This article is categorized under: Linguistics > Computational Models of Language
Numerous studies have shown that younger adults engage in lexically guided perceptual learning in speech perception. Here, we investigated whether older listeners are also able to retune their phonetic category boundaries. More specifically, in this research we tried to answer two questions. First, do older adults show perceptual-learning effects of similar size to those of younger adults? Second, do differences in lexical behavior predict the strength of the perceptual-learning effect? An age group comparison revealed that older listeners do engage in lexically guided perceptual learning, but there were two agerelated differences: Younger listeners had a stronger learning effect right after exposure than did older listeners, but the effect was more stable for older than for younger listeners. Moreover, a clear link was shown to exist between individuals' lexical-decision performance during exposure and the magnitude of their perceptual-learning effects. A subsequent analysis on the results of the older participants revealed that, even within the older participant group, with increasing age the perceptual retuning effect became smaller but also more stable, mirroring the age group comparison results. These results could not be explained by differences in hearing loss. The age effect may be accounted for by decreased flexibility in the adjustment of phoneme categories or by age-related changes in the dynamics of spokenword recognition, with older adults being more affected by competition from similar-sounding lexical competitors, resulting in less lexical guidance for perceptual retuning.In conclusion, our results clearly show that the speech perception system remains flexible over the life span.Keywords Perceptual learning . Speech perception . Aging . Individual differencesNumerous studies have shown that "ideal" listeners-that is, young, normal-hearing, highly educated listeners-can adapt to idiosyncratic pronunciations through lexically guided perceptual learning in speech perception (McQueen, Cutler, & Norris, 2006;Norris, McQueen, & Cutler, 2003; for an overview, see Samuel & Kraljic, 2009), and are thus able to tune in to a speaker to understand him or her better. The lexically guided perceptual learning effect has been shown using a variety of exposure and test paradigms-for instance, lexical decision and phonetic categorization (e.g., Norris et al., 2003), short story presentation and phonetic categorization (e.g., Eisner & McQueen, 2006), and a picture verification procedure (e.g., McQueen, Tyler, & Cutler, 2012). In the exposure phase, listeners are exposed to an idiosyncratic sound-for instance, a sound ambiguous between [s] and [f] (/ f / s /), which would be learned as /s/ if it was heard in words such as platypus (because platypus is an existing word in English, whereas platypuf is not), but as /f/ in words such as giraffe (which is an existing word in English, whereas giras is not). This perceptual-learning effect is caused by a temporary change in phonetic category representations, rat...
Recent evidence shows that listeners use abstract prelexical units in speech perception. Using the phenomenon of lexical retuning in speech processing, we ask whether those units are necessarily phonemic. Dutch listeners were exposed to a Dutch speaker producing ambiguous phones between the Dutch syllable-final allophones approximant [r] and dark [l]. These ambiguous phones replaced either final /r/ or final /l/ in words in a lexical-decision task. This differential exposure affected perception of ambiguous stimuli on the same allophone continuum in a subsequent phonetic-categorization test: Listeners exposed to ambiguous phones in /r/-final words were more likely to perceive test stimuli as /r/ than listeners with exposure in /l/-final words. This effect was not found for test stimuli on continua using other allophones of /r/ and /l/. These results confirm that listeners use phonological abstraction in speech perception. They also show that context-sensitive allophones can play a role in this process, and hence that context-insensitive phonemes are not necessary. We suggest there may be no one unit of perception.
This study investigates two variables that may modify lexically guided perceptual learning: individual hearing sensitivity and attentional abilities. Older Dutch listeners (aged 60+ years, varying from good hearing to mild-tomoderate high-frequency hearing loss) were tested on a lexically guided perceptual learning task using the contrast [f]- [s]. This contrast mainly differentiates between the two consonants in the higher frequencies, and thus is supposedly challenging for listeners with hearing loss. The analyses showed that older listeners generally engage in lexically guided perceptual learning. Hearing loss and selective attention did not modify perceptual learning in our participant sample, while attention-switching control did: listeners with poorer attention-switching control showed a stronger perceptual learning effect. We postulate that listeners with better attention-switching control may, in general, rely more strongly on bottom-up acoustic information compared to listeners with poorer attention-switching control, making them in turn less susceptible to lexically guided perceptual learning. Our results, moreover, clearly show that lexically guided perceptual learning is not lost when acoustic processing is less accurate.
In spontaneous, conversational speech, words are often reduced compared to their citation forms, such that a word like yesterday may sound like ['jESeI]. The present paper investigates such acoustic reduction . The study of reduction needs large corpora that are transcribed phonetically. The first part of this paper describes an automatic transcription procedure used to obtain such a large phonetically transcribed corpus of Dutch spontaneous dialogues, which is subsequently used for the investigation of acoustic reduction. First, the orthographic transcription were adapted for automatic processing. Next, the phonetic transcription of the corpus was created by means of a forced alignment with a lexicon with multiple pronunciation variants per word. These variants were generated by applying phonological and reduction rules to the canonical phonetic transcriptions of the words. The second part of this paper reports the results of a quantitative analysis of reduction in the corpus on the basis of the generated transcriptions and gives an inventory of segmental reductions in standard Dutch. Overall, we found that reduction is more pervasive in spontaneous Dutch than previously documented.
The fields of human speech recognition (HSR) and automatic speech recognition (ASR) both investigate parts of the speech recognition process and have word recognition as their central issue. Although the research fields appear closely related, their aims and research methods are quite different. Despite these differences there is, however, lately a growing interest in possible cross-fertilisation. Researchers from both ASR and HSR are realising the potential benefit of looking at the research field on the other side of the ‘gap’. In this paper, we provide an overview of past and present efforts to link human and automatic speech recognition research and present an overview of the literature describing the performance difference between machines and human listeners. The focus of the paper is on the mutual benefits to be derived from establishing closer collaborations and knowledge interchange between ASR and HSR. The paper ends with an argument for more and closer collaborations between researchers of ASR and HSR to further improve research in both fields
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.