Rui Ping Shi scite author profile

Batliner

et al. 2000

We describe the acoustic-prosodic and syntactic-prosodic annotation and classification of boundaries, accents and sentence mood integrated in the Verbmobil system for the three languages German, English, and Japanese. For the acoustic-prosodic classification, a large feature vector with normalized prosodic features is used. For the three languages, a multilingual prosody module was developed that reduces memory requirement considerably, compared to three monolingual modules. For classification, neural networks and statistic language models are used.

The Prosody Module

Zeißler

Batliner

et al. 2006

In multimodal dialogue systems, several input and output modalities are used for user interaction. The most important modality for human computer interaction is speech. Similar to human human interaction, it is necessary in the human computer interaction that the machine recognizes the spoken word chain in the user's utterance. For better communication with the user it is advantageous to recognize his internal emotional state because it is then possible to adapt the dialogue strategy to the situation in order to reduce, for example, anger or uncertainty of the user.In the following sections we describe first the state of the art in emotion and user state recognition with the help of prosody. The next section describes the prosody module. After that we present the experiments and results for recognition of user states. We summarize our results in the last section.

Multimodal Emogram, Data Collection and Presentation

Frank

Nöth

et al.

Summary. There are several characteristics not optimally suited for the user state classification with Wizard-of-Oz (WOZ) data like the nonuniform distribution of emotions in the utterances and the distribution of emotional utterances in speech, facial expression, and gesture. In particular, the fact that most of the data collected in the WOZ experiments are without any emotional expression gives rise to the problem of getting enough representative data for training the classifiers. Because of this problem we collected data in our own database. These data are also relevant for several demonstration sessions, where the functionality of the SMARTKOM system is shown in accordance with the defined use cases.In the following we first describe the system environment for data collection and then the collected data. At the end we will discuss the tool to demonstrate user states detected in the different modalities. Database with Acted User StatesBecause of the lack of training data we decided to build our own database and to collect uniformly distributed data containing emotional expression of user state in all three handled modalities -speech, gesture and facial expression (see Streit et al. (2006) and for an online demonstration refer to our website 1 ). We collected data of instructed subjects, who should express four user states for recording. Because SMARTKOM is a demonstration system it is sufficient to use instructed data for the training database.For our study we collected data from 63 naive subjects (41 male/22 female). They were instructed to act as if they had asked the SMARTKOM system for the TV program and felt content, unsatisfied, helpless or neutral with the system feedbacks. Different genres such as news, daily soap and science reports were projected onto the display for selection. The subjects were prompted with an utterance displayed on the screen and were then to indicate their internal state through voice and gesture, and at the same time, through different facial expressions.

Multimodal User State Recognition in a Modern Dialogue System

Adelhardt¹,

Shi²,

Frank³

et al. 2003

Dieses Dokument gehört zum Teilprojekt 1 Abstract.A new direction in improving automatic dialogue systems is to make a human-machine dialogue more similar to a human-human dialogue. A modern system should be able to recognize the semantic content of spoken utterances but also to interpret some paralinguistic or non-verbal information -as indicators of the internal user state -in order to detect success or trouble in communication. A common problem in a human-machine dialogue, where information about a users internal state of mind may give a clue, is, for instance, the recurrent misunderstanding of the user by the system. This can be prevented if we detect the anger in the users voice. In contrast to anger, a joyful face combined with a pleased voice may indicate a satisfied user, who wants to go on with the current dialogue behavior, while a hesitant searching gesture of the user reveals his unsureness. This paper explores the possibility of recognizing a user's internal state by using facial expression classification with eigenfaces and a prosodic classifier based on artificial neural networks combined with a discrete Hidden Markov Model (HMM) for gesture analysis in parallel. Our experiments show that all the three input modalities can be used to identify a users internal state. However, a user state is not always indicated by all three modalities at the same time; thus a fusion of the different modalities seems to be necessary. Different ways of modality fusion are discussed.

The Gesture Interpretation Module

Shi

Batliner

et al. 2006

Summary. Humans make often conscious and unconscious gestures, which reflect their mind, thoughts and the way these are formulated. These inherently complex processes can in general not be substituted by a corresponding verbal utterance that has the same semantics (McNeill, 1992). Gesture, which is a kind of body language, contains important information on the intention and the state of the gesture producer. Therefore, it is an important communication channels in human computer interaction.In the following we describe first the state of the art in gesture recognition. The next section describes the gesture interpretation module. After that we present the experiments and results for recognition of user states. We summarize our results in the last section.