There is a consensus concerning the view that both auditory and motor representations intervene in the perceptual processing of speech units. However, the question of the functional role of each of these systems remains seldom addressed and poorly understood. We capitalized on the formal framework of Bayesian Programming to develop COSMO (Communicating Objects using Sensory-Motor Operations), an integrative model that allows principled comparisons of purely motor or purely auditory implementations of a speech perception task and tests the gain of efficiency provided by their Bayesian fusion. Here, we show 3 main results: (a) In a set of precisely defined “perfect conditions,” auditory and motor theories of speech perception are indistinguishable; (b) When a learning process that mimics speech development is introduced into COSMO, it departs from these perfect conditions. Then auditory recognition becomes more efficient than motor recognition in dealing with learned stimuli, while motor recognition is more efficient in adverse conditions. We interpret this result as a general “auditory-narrowband versus motor-wideband” property; and (c) Simulations of plosive-vowel syllable recognition reveal possible cues from motor recognition for the invariant specification of the place of plosive articulation in context that are lacking in the auditory pathway. This provides COSMO with a second property, where auditory cues would be more efficient for vowel decoding and motor cues for plosive articulation decoding. These simulations provide several predictions, which are in good agreement with experimental data and suggest that there is natural complementarity between auditory and motor processing within a perceptuo-motor theory of speech perception.
Although speakers of one specific language share the same phoneme representations, their productions can differ. We propose to investigate the development of these differences in production, called idiosyncrasies, by using a Bayesian model of communication. Supposing that idiosyncrasies appear during the development of the motor system, we present two versions of the motor learning phase, both based on the guidance of an agent master: "a repetition model" where agents try to imitate the sounds produced by the master and "a communication model" where agents try to replicate the phonemes produced by the master. Our experimental results show that only the "communication model" provides production idiosyncrasies, suggesting that idiosyncrasies are a natural output of a motor learning process based on a communicative goal.
The existence of a functional relationship between speech perception and production systems is now widely accepted, but the exact nature and role of this relationship remains quite unclear. The existence of idiosyncrasies in production and in perception sheds interesting light on the nature of the link. Indeed, a number of studies explore inter-individual variability in auditory and motor prototypes within a given language, and provide evidence for a link between both sets. In this paper, we attempt to simulate one study on coupled idiosyncrasies in the perception and production of French oral vowels, within COSMO, a Bayesian computational model of speech communication. First, we show that if the learning process in COSMO includes a communicative mechanism between a Learning Agent and a Master Agent, vowel production does display idiosyncrasies. Second, we implement within COSMO three models for speech perception that are, respectively, auditory, motor and perceptuo-motor. We show that no idiosyncrasy in perception can be obtained in the auditory model, since it is optimally tuned to the learning environment, which does not include the motor variability of the Learning Agent. On the contrary, motor and perceptuo-motor models provide perception idiosyncrasies correlated with idiosyncrasies in production. We draw conclusions about the role and importance of motor processes in speech perception, and propose a perceptuo-motor model in which auditory processing would enable optimal processing of learned sounds and motor processing would be helpful in unlearned adverse conditions.
The contribution by M.A. Arbib over the years and as it appears summarized and conceptualized in this paper [1] is admirable, extremely impressive, and very convincing in many aspects. A key value of this work is that it systematically attempts to introduce formal conceptualization and modeling in the reasoning about facts and interpretations.We would like to focus on a component of language -actually minor in Arbib's chapter -that is phonology, in light of a model of speech communication developed by our group. If the mirror system paved the way for a "language-ready brain", this should include a "phonology-ready brain". Among the seven properties of the "language-readiness" listed in Section 1.6, two of them should provide the basis for crucial aspects of a phonological system. Firstly, property (1), "Complex action recognition", involves both action analysis, necessary for learning the components of a complex vocal action and decomposing it into phonological segments, and action chunking, enabling to utter phonological sequences likely to convey meaning. Secondly, property (2), "parity", ensures that the communicative value of a phonological unit plays the same role for the speaker and the listener.Let us begin by parity. In Section 4.1, Arbib insists that the mirror system does not correspond to a motor theory of speech perception. He proposes that vocal actions (speech utterances) can be recognized and understood by "general mechanisms which need not involve the mirror system strongly", and that the mirror system would just complement such general mechanisms -which is actually the basis of the model elaborated by Moulin-Frier and Arbib [3] and presented in the paper. This is where our computation studies could shed some more light on when general mechanisms for recognizing phonemes could suffice, and when the mirror system could be useful.DOI of original article: http://dx.
The three trends about the invariants of speech perception Auditory theories: the invariants are acoustic properties. Motor theories: the invariants are the intended phonetic gestures of the speaker. Perceptuo-motor theories: the invariants are perceptuo-motor units characterized by both their articulatory coherence and their perceptual value.
Although sensorimotor exploration is a basic process within child development, clear views on the underlying computational processes remain challenging. We propose to compare eight algorithms for sensorimotor exploration, based on three components: "accommodation" performing a compromise between goal babbling and social guidance by a master, "local extrapolation" simulating local exploration of the sensorimotor space to achieve motor generalizations and "idiosyncratic babbling" which favors already explored motor commands when they are efficient. We will show that a mix of these three components offers a good compromise enabling efficient learning while reducing exploration as much as possible.
During speech development, babies learn to perceive and produce speech units, especially syllables and phonemes. However, the mechanisms underlying the acquisition of speech units still remain unclear. We propose a Bayesian model of speech communication, named "COSMO SylPhon", for studying the acquisition of both syllables and phonemes. In this model, speech development involves a sensory learning phase, mainly concerned with perception development, and a motor learning phase, mainly concerned with production development. We study how an agent can learn speech units during these two phases through an unsupervised learning process based on syllable stimuli. We show that the learning process enables to efficiently learn the distribution of syllabic stimuli provided in the environment. Importantly, we show that if agents are equipped with a bootstrap process inspired by the Frame-Content Theory of speech development, they learn to associate consonants to specific articulatory gestures, providing the basis for consonantal articulatory invariance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.