Active speaker detection in human machine multiparty dialogue using visual prosody information

Haider, Fasih; Campbell, Nick; Luz, Saturnino

doi:10.1109/globalsip.2016.7906033

Cited by 14 publications

(21 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In these studies, it was shown that lip information in the speech section and time section immediately before speech is useful for improving the performance of ASD. These previous research results [30][31][32][33][34]36] support the validity of our approach for predicting the next speaker and utterance interval using the mouth opening pattern at the end of an utterance.…”

Section: Mouth-opening Movement and Speakingsupporting

confidence: 77%

“…To focus on lip information, Cutler et al used the image feature of the mouth region according to the audio information [33]. Haider et al used head movements in addition to lip information in speech [34]. They also showed that the features of lip and head movement one second before the start of speech are useful for improving the function of ASD [35].…”

Section: Mouth-opening Movement and Speakingmentioning

confidence: 99%

See 1 more Smart Citation

Prediction of Who Will Be Next Speaker and When Using Mouth-Opening Pattern in Multi-Party Conversation

Ishii

Otsuka²,

Kumano³

et al. 2019

MTI

View full text Add to dashboard Cite

We investigated the mouth-opening transition pattern (MOTP), which represents the change of mouth-opening degree during the end of an utterance, and used it to predict the next speaker and utterance interval between the start time of the next speaker’s utterance and the end time of the current speaker’s utterance in a multi-party conversation. We first collected verbal and nonverbal data that include speech and the degree of mouth opening (closed, narrow-open, wide-open) of participants that were manually annotated in four-person conversation. A key finding of the MOTP analysis is that the current speaker often keeps her mouth narrow-open during turn-keeping and starts to close it after opening it narrowly or continues to open it widely during turn-changing. The next speaker often starts to open her mouth narrowly after closing it during turn-changing. Moreover, when the current speaker starts to close her mouth after opening it narrowly in turn-keeping, the utterance interval tends to be short. In contrast, when the current speaker and the listeners open their mouths narrowly after opening them narrowly and then widely, the utterance interval tends to be long. On the basis of these results, we implemented prediction models of the next-speaker and utterance interval using MOTPs. As a multimodal-feature fusion, we also implemented models using eye-gaze behavior, which is one of the most useful items of information for prediction of next-speaker and utterance interval according to our previous study, in addition to MOTPs. The evaluation result of the models suggests that the MOTPs of the current speaker and listeners are effective for predicting the next speaker and utterance interval in multi-party conversation. Our multimodal-feature fusion model using MOTPs and eye-gaze behavior is more useful for predicting the next speaker and utterance interval than using only one or the other.

show abstract

Section: Mouth-opening Movement and Speakingsupporting

confidence: 77%

Section: Mouth-opening Movement and Speakingmentioning

confidence: 99%

Prediction of Who Will Be Next Speaker and When Using Mouth-Opening Pattern in Multi-Party Conversation

Ishii

Otsuka²,

Kumano³

et al. 2019

MTI

View full text Add to dashboard Cite

show abstract

“…This study continues the authors' past work [8,9] which demonstrated the use of lip and head movements during speech articulation for active speaker detection but did not assess the discriminative power of visual prosody data captured just before and/or after articulation. In this study, we propose methods for detection of active speakers through use of visual prosody information one second before/after speech articulation and also evaluate the visual prosody information of the first second of the speech utterance.…”

Section: Introductionsupporting

confidence: 78%

“…An audio-visual dataset [8,9] was collected in a task-free dialogue setting. Four participants (3 males and 1 female) converse with the "machine", but they are not allowed to speak with each other directly.…”

Section: Data Collectionmentioning

confidence: 99%

“…Although in previous studies, we have demonstrated that head and lip movements during speech articulation can predict the active speaker with promising results [8,9], the response time issues and evaluation of visual prosody information just before/after articulation are left open. Cech et al report on an active speaker detection system for a humanoid robot that uses audio and visual information of four microphones and two cameras [5].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving Response Time of Active Speaker Detection Using Visual Prosody Information Prior to Articulation

Haider¹,

Luz²,

Vogel³

et al. 2018

Interspeech 2018

Self Cite

View full text Add to dashboard Cite

Natural multi-party interaction commonly involves turning one's gaze towards the speaker who has the floor. Implementing virtual agents or robots who are able to engage in natural conversations with humans therefore requires enabling machines to exhibit this form of communicative behaviour. This task is called active speaker detection. In this paper, we propose a method for active speaker detection using visual prosody (lip and head movements) information before and after speech articulation to decrease the machine response time; and also demonstrate the discriminating power of visual prosody before and after speech articulation for active speaker detection. The results show that the visual prosody information one second before articulation is helpful in detecting the active speaker. Lip movements provide better results than head movements, and fusion of both improves accuracy. We have also used visual prosody information of the first second of the speech utterance and found that it provides more accurate results than one second before articulation. We conclude that the fusion of lip movements from both regions (the first one second of speech and the one second before articulation) improves the accuracy for active speaker detection.

show abstract

Watching People Talk; How Machines Can Know We Understand Them—A Study of Engagement in a Conversational Corpus

Campbell

2019

The Temporal Structure of Multimodal Communication

View full text Add to dashboard Cite

Active speaker detection in human machine multiparty dialogue using visual prosody information

Cited by 14 publications

References 16 publications

Prediction of Who Will Be Next Speaker and When Using Mouth-Opening Pattern in Multi-Party Conversation

Prediction of Who Will Be Next Speaker and When Using Mouth-Opening Pattern in Multi-Party Conversation

Improving Response Time of Active Speaker Detection Using Visual Prosody Information Prior to Articulation

Watching People Talk; How Machines Can Know We Understand Them—A Study of Engagement in a Conversational Corpus

Contact Info

Product

Resources

About