Speech research during recent years has moved progressively away from its traditional focus on audition toward a more multisensory approach. In addition to audition and vision, many somatosenses including proprioception, pressure, vibration, and aerotactile sensation are all highly relevant modalities for experiencing and/or conveying speech. In this article, we review both long-standing cross-modal effects stemming from decades of audiovisual speech research and new findings related to somatosensory effects. Cross-modal effects in speech perception to date have been found to be constrained by temporal congruence and signal relevance, but appear to be unconstrained by spatial congruence. The literature reveals that, far from taking place in a one-, two-, or even three-dimensional space, speech occupies a highly multidimensional sensory space. We argue that future research in cross-modal effects should expand to consider each of these modalities both separately and in combination with other modalities in speech.
Human bodies exhibit lateral biases between many laterally symmetrical body parts (e.g., hands, feet, eyes, and ears) that increase the efficiency of behaviour and functionality of a system. We report on lateral biases observed in the tongue during speech. The tongue is bilaterally braced against the back teeth and hard palate throughout speech, but is interrupted for the production of some laterals and occasional low vowels; some evidence suggests the movement away from the braced posture may be produced by lowering one side of the tongue first and that the leading side is consistent within speaker [Gick et al. 2017, JSLHR 60(3), 494]. We report findings on lateral bias in English speakers, on its correlation with other lateral biases of the speakers, and what this may imply about the origins of this bias. Preliminary results indicate some variation, with a population-level bias (preference for one side over the other), suggesting that the bias may develop with cortical modulation in much the same way that handedness is thought to arise. [Funding through NSERC.]
Listeners incorporate visual speech information produced by computer-simulated faces when the articulations are precise and pre-programmed [e.g., Cohen, & Massaro 1990, Behav. Res. Meth. Instr. Comp. 22(2), 260–263]. Advances in virtual reality (VR) and avatar technologies have created new platforms for face-to-face communication in which visual speech information is presented through avatars. The avatars’ articulatory movements may be generated in real time based on an algorithmic response to acoustic parameters. While the communicative experience in VR has become increasingly realistic, the visual speech articulations remain intentionally imperfect and focused on synchrony to avoid uncanny valley effects [https://developers.facebook.com/videos/f8-2017/the-making-of-facebook-spaces/]. Depending on the VR platform, vowel rounding may be represented reasonably faithfully while mouth opening size may convey gross variation in amplitude. It is unknown whether and how perceivers make use of such underspecified and at times misleading visual cues to speech. The current study investigates whether reliable segmental information can be extracted from visual speech algorithmically generated through a popular VR platform. We report on an experiment using a speech in noise task with audiovisual stimuli in two conditions (with articulatory movement and without) to see whether the visual information improves or degrades identification.
Previous research has shown that the sensation of airflow causes bilabial stop closures to be perceived as aspirated even when paired with silent articulations rather than an acoustic signal [Bicevskis et al. 2016, JASA 140(5): 3531–3539]. However, some evidence suggests that perceivers integrate this cue differently if the silent articulations come from an animated face [Keough et al. 2017, Canadian Acoustics 45(3):176–177] rather than a human one. Participants shifted from a strong initial /ba/ bias to a strong /pa/ bias by the second half of the experiment, suggesting the participants learned to associate the video with the aspirated articulation through experience with the airflow. One explanation for the above findings is methodological: participants saw a single video clip while previous work exposed participants to multiple videos. The current study reports two experiments using a single clip with a human face (originally from Bicevskis et al. 2016). We found no evidence of a bias shift, indicating that the findings reported by Keough et al. are not attributable to the use of a single video. Instead, our findings suggest that aero-tactile cues shift consonant perception regardless of the number of recordings presented as long as the speaking face is human.
Listeners integrate information from simulated faces in multimodal perception [Cohen, & Massaro 1990, Behav. Res. Meth. Instr. Comp. 22(2), 260–263], but not always in the same way as real faces [Keough et al. 2017, Can. Acoust. 45(3):176–177]. This is increasingly relevant with the dramatic increase in avatar communication in virtual spaces [https://www.bloomberg.com/professional/blog/computings-next-big-thing-virtual-world-may-reality-2020/]. Prosody is especially relevant, because compared to segmental speech sounds, the visual factors indicating prosodic prominence (e.g., eyebrow raises and hand gestures) frequently bear no biomechanical relation to the production of acoustic features of prominence, but are nonetheless highly reliable [Krahmer & Swerts 2007, JML 57(3): 396–414], and avatar virtual communication systems may convey prosodic information through unnatural means, e.g., by expressing amplitude via oral aperture (louder sound = larger opening); the present study examines whether this unnatural but reliable indicator of speech amplitude is integrated in prominence perception. We report an experiment describing whether and how perceivers take into account this reliable but unnatural visual information in the detection of prosodic prominence. Preliminary evidence suggests that oral aperture increases prominence with differences by sentence position.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.