“…For this, we used the trained Temporal-aware bI-direction Multiscale Network (TIM-Net) model ( Ye et al, 2023 ), which is a state-of-the-art temporal emotional modeling solution, trained on six benchmark Speech Emotion Recognition (SER) datasets, i.e. , Chinese corpus CASIA, German corpus EMODB, Italian corpus EMOVO, English corpora IEMOCAP, RAVDESS and SAVEE ( Busso et al, 2008 ; Tao et al, 2008 ; Jackson and Haq, 2010 ; Costantini et al, 2014 ; Livingstone and Russo, 2018 ; Benlamine and Frasson, 2021 ).…”
Section: Discussionmentioning
confidence: 99%
“…To investigate this possibility, we cropped all participant audio speech signals for each labeled condition, then extracted the speech spectral features from those data using MFCCs features as input, in order to directly predict four different salient emotion indexes, i.e., "anger", "happy", "neutral", and "sad". For this, we used the trained Temporal-aware bI-direction Multiscale Network (TIM-Net) model (Ye et al, 2023), which is a state-of-the-art temporal emotional modeling solution, trained on six benchmark Speech Emotion Recognition (SER) datasets, i.e., Chinese corpus CASIA, German corpus EMODB, Italian corpus EMOVO, English corpora IEMOCAP, RAVDESS and SAVEE (Busso et al, 2008;Tao et al, 2008;Jackson and Haq, 2010;Costantini et al, 2014;Livingstone and Russo, 2018;Benlamine and Frasson, 2021).…”
Section: Speech Emotion Analysismentioning
confidence: 99%
“…In much of this work, there is an assumption that designing an engaged conversational HRI is critical. However, while some studies have examined emotion estimation in HRI that can promote user engagement (Busso et al, 2008;Celiktutan et al, 2017); there is relatively little research to date on comprehensive user engagement estimation in either situated conversational HRI or spontaneous conversation, e.g., Ben Youssef et al (2017), and Benlamine and Frasson (2021).…”
Successful conversational interaction with a social robot requires not only an assessment of a user’s contribution to an interaction, but also awareness of their emotional and attitudinal states as the interaction unfolds. To this end, our research aims to systematically trigger, but then interpret human behaviors to track different states of potential user confusion in interaction so that systems can be primed to adjust their policies in light of users entering confusion states. In this paper, we present a detailed human-robot interaction study to prompt, investigate, and eventually detect confusion states in users. The study itself employs a Wizard-of-Oz (WoZ) style design with a Pepper robot to prompt confusion states for task-oriented dialogues in a well-defined manner. The data collected from 81 participants includes audio and visual data, from both the robot’s perspective and the environment, as well as participant survey data. From these data, we evaluated the correlations of induced confusion conditions with multimodal data, including eye gaze estimation, head pose estimation, facial emotion detection, silence duration time, and user speech analysis—including emotion and pitch analysis. Analysis shows significant differences of participants’ behaviors in states of confusion based on these signals, as well as a strong correlation between confusion conditions and participants own self-reported confusion scores. The paper establishes strong correlations between confusion levels and these observable features, and lays the ground or a more complete social and affect oriented strategy for task-oriented human-robot interaction. The contributions of this paper include the methodology applied, dataset, and our systematic analysis.
“…For this, we used the trained Temporal-aware bI-direction Multiscale Network (TIM-Net) model ( Ye et al, 2023 ), which is a state-of-the-art temporal emotional modeling solution, trained on six benchmark Speech Emotion Recognition (SER) datasets, i.e. , Chinese corpus CASIA, German corpus EMODB, Italian corpus EMOVO, English corpora IEMOCAP, RAVDESS and SAVEE ( Busso et al, 2008 ; Tao et al, 2008 ; Jackson and Haq, 2010 ; Costantini et al, 2014 ; Livingstone and Russo, 2018 ; Benlamine and Frasson, 2021 ).…”
Section: Discussionmentioning
confidence: 99%
“…To investigate this possibility, we cropped all participant audio speech signals for each labeled condition, then extracted the speech spectral features from those data using MFCCs features as input, in order to directly predict four different salient emotion indexes, i.e., "anger", "happy", "neutral", and "sad". For this, we used the trained Temporal-aware bI-direction Multiscale Network (TIM-Net) model (Ye et al, 2023), which is a state-of-the-art temporal emotional modeling solution, trained on six benchmark Speech Emotion Recognition (SER) datasets, i.e., Chinese corpus CASIA, German corpus EMODB, Italian corpus EMOVO, English corpora IEMOCAP, RAVDESS and SAVEE (Busso et al, 2008;Tao et al, 2008;Jackson and Haq, 2010;Costantini et al, 2014;Livingstone and Russo, 2018;Benlamine and Frasson, 2021).…”
Section: Speech Emotion Analysismentioning
confidence: 99%
“…In much of this work, there is an assumption that designing an engaged conversational HRI is critical. However, while some studies have examined emotion estimation in HRI that can promote user engagement (Busso et al, 2008;Celiktutan et al, 2017); there is relatively little research to date on comprehensive user engagement estimation in either situated conversational HRI or spontaneous conversation, e.g., Ben Youssef et al (2017), and Benlamine and Frasson (2021).…”
Successful conversational interaction with a social robot requires not only an assessment of a user’s contribution to an interaction, but also awareness of their emotional and attitudinal states as the interaction unfolds. To this end, our research aims to systematically trigger, but then interpret human behaviors to track different states of potential user confusion in interaction so that systems can be primed to adjust their policies in light of users entering confusion states. In this paper, we present a detailed human-robot interaction study to prompt, investigate, and eventually detect confusion states in users. The study itself employs a Wizard-of-Oz (WoZ) style design with a Pepper robot to prompt confusion states for task-oriented dialogues in a well-defined manner. The data collected from 81 participants includes audio and visual data, from both the robot’s perspective and the environment, as well as participant survey data. From these data, we evaluated the correlations of induced confusion conditions with multimodal data, including eye gaze estimation, head pose estimation, facial emotion detection, silence duration time, and user speech analysis—including emotion and pitch analysis. Analysis shows significant differences of participants’ behaviors in states of confusion based on these signals, as well as a strong correlation between confusion conditions and participants own self-reported confusion scores. The paper establishes strong correlations between confusion levels and these observable features, and lays the ground or a more complete social and affect oriented strategy for task-oriented human-robot interaction. The contributions of this paper include the methodology applied, dataset, and our systematic analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.