Purpose Conversational entrainment, the phenomenon whereby communication partners synchronize their behavior, is considered essential for productive and fulfilling conversation. Lack of entrainment could, therefore, negatively impact conversational success. Although studied in many disciplines, entrainment has received limited attention in the field of speech-language pathology, where its implications may have direct clinical relevance. Method A novel computational methodology, informed by expert clinical assessment of conversation, was developed to investigate conversational entrainment across multiple speech dimensions in a corpus of experimentally elicited conversations involving healthy participants. The predictive relationship between the methodology output and an objective measure of conversational success, communicative efficiency, was then examined. Results Using a real versus sham validation procedure, we find evidence of sustained entrainment in rhythmic, articulatory, and phonatory dimensions of speech. We further validate the methodology, showing that models built on speech signal entrainment measures consistently outperform models built on nonentrained speech signal measures in predicting communicative efficiency of the conversations. Conclusions A multidimensional, clinically meaningful methodology for capturing conversational entrainment, validated in healthy populations, has implications for disciplines such as speech-language pathology where conversational entrainment represents a critical knowledge gap in the field, as well as a potential target for remediation.
In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.
Acoustic-prosodic entrainment describes the tendency of humans to align or adapt their speech acoustics to each other in conversation. This alignment of spoken behavior has important implications for conversational success. However, modeling the subtle nature of entrainment in spoken dialogue continues to pose a challenge. In this paper, we propose a straightforward definition for local entrainment in the speech domain and operationalize an algorithm based on this: acoustic-prosodic features that capture entrainment should be maximally different between real conversations involving two partners and sham conversations generated by randomly mixing the speaking turns from the original two conversational partners. We propose an approach for measuring local entrainment that quantifies alignment of behavior on a turn-by-turn basis, projecting the differences between interlocutors' acoustic-prosodic features for a given turn onto a discriminative feature subspace that maximizes the difference between real and sham conversations. We evaluate the method using the derived features to drive a classifier aiming to predict an objective measure of conversational success (i.e., low versus high), on a corpus of task-oriented conversations. The proposed entrainment approach achieves 72% classification accuracy using a Naive Bayes classifier, outperforming three previously established approaches evaluated on the same conversational corpus.
Purpose Coordination of communicative behavior supports shared understanding in conversation. The current study brings together analysis of two speech coordination strategies, entrainment and compensation of articulation, in a preliminary investigation into whether strategy organization is shaped by a challenging communicative context—conversing with a person who has a communication disorder. Method As an initial clinical test case, an automated measure of articulatory precision was analyzed in a corpus of spoken dialogue, where a confederate conversed with participants with traumatic brain injury ( n = 28) and participants with no brain injury ( n = 48). Results Overall, the confederate engaged in significant entrainment and high compensation (hyperarticulation) in conversations with participants with traumatic brain injury relative to significant entrainment and low compensation (hypoarticulation) in conversations with participants with no brain injury. Furthermore, the confederate's articulatory precision changed over the course of the conversations. Conclusions Findings suggest that the organization of conversational coordination is sensitive to context, supporting synergistic models of spoken dialogue. While corpus limitations are acknowledged, these initial results point to differences in the way in which speech strategies are realized in challenging communicative contexts, highlighting a viable and important target for investigation with clinical populations. A framework for investigating speech coordination strategies in tandem and ideas for advancing this line of inquiry serve as key contributions of this work.
The communication phenomenon known as conversational entrainment occurs when dialogue partners align or adapt their behavior to one another while conversing. Associated with rapport, trust, and communicative efficiency, entrainment appears to facilitate conversational success. In this work, we explore how conversational partners entrain or align on articulatory precision or the clarity with which speakers articulate their spoken productions. Articulatory precision also has implications for conversational success as precise articulation can enhance speech understanding and intelligibility. However, in conversational speech, speakers tend to reduce their articulatory precision, preferring low-cost, imprecise speech. Speakers may adapt their articulation and become more precise depending on feedback from their listeners. Given the potential of entrainment, we are interested in how conversational partners adapt or entrain their articulatory precision to one another. We explore this phenomenon in 57 task-based dialogues. Controlling for the influence of speaking rate, we find that speakers entrain on articulatory precision, with significant alignment on articulation of consonants. We discuss the potential applications that speaker alignment on precision might have for modeling conversation and implementing strategies for enhancing communicative success in human-human and human-computer interactions.
Purpose For individuals with Parkinson's disease (PD), conversational interactions can be challenging. Efforts to improve the success of these interactions have largely fallen on the individual with PD. Successful communication, however, involves contributions from both the individual with PD and their communication partner. The current study examines whether healthy communication partners naturally engage in different acoustic–prosodic behavior (speech compensations) when conversing with an individual with PD and, further, whether such behavior aids communication success. Method Measures of articulatory precision, speaking rate, and pitch variability were extracted from the speech of healthy speakers engaged in goal-directed dialogue with other healthy speakers (healthy–healthy dyads) and with individuals with PD (healthy–PD dyads). Speech compensations, operationally defined as significant differences in healthy speakers' acoustic–prosodic behavior in healthy–healthy dyads versus healthy–PD dyads, were calculated for the three speech behaviors. Finally, the relationships between speech behaviors and an objective measure of communicative efficiency were examined. Results Healthy speakers engaged in speech characterized by greater articulatory precision and slower speaking rate when conversing with individuals with PD relative to conversations with other healthy individuals. However, these adaptive speech compensations were not predictive of communicative efficiency. Conclusions Evidence that healthy speakers naturally engage in speech compensations when conversing with individuals with PD is novel, yet consistent with findings from studies with other populations in which conversation can be challenging. In the case of PD, these compensatory behaviors did not support communication outcomes. While preliminary in nature, the results raise important questions regarding the speech behavior of healthy communication partners and provide directions for future work.
Previous research on stop consonants found that less than 60 percent of the stops sampled from a speech corpus contained a clearly defined period of silence or prevoicing prior to the plosive release [Crystal & House, JASA, 1988]. How listeners perceive a reduced form of stop consonants without these cues is not well understood. The purpose of this experiment was to investigate whether recasting typical formant transitions into a measure called a “relative formant deflection pattern” provides a means of predicting listeners’ perceptions of approximant-like, voiced stop consonant variants. A computational model of speech production, in which consonant constriction location was varied along the length of the vocal tract, was used to generate place continua of approximate-like, voiced stop consonants imposed on a vowel-to-vowel transition. Stimuli were presented to listeners in three conditions: 1) normal simulated speech, 2) sinewave speech in which three tones replicated the time course of the F1, F2, and F3 contours in the simulated samples, and 3) sinewave speech in which three tones were present, but selected combinations of F1, F2, and F3 were set to a flat contour. Perceptual responses will be compared to the predictions based on relative formant deflection patterns across conditions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.