A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

Otsuka, Kazuhiro; Araki, Shoko; Ishizuka, Kentaro; Fujimoto, Masakiyo; Heinrich, Martin; Yamato, Junji

doi:10.1145/1452392.1452446

Cited by 71 publications

(24 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…These could be remote meeting participants in a teleconference situation or users of meeting archive systems ( [28], [29], [30]). Because of the real-time constraint the most challenging is the use of these technologies by remote participants in an ongoing meeting.…”

Section: Discussionmentioning

confidence: 99%

Supporting Engagement and Floor Control in Hybrid Meetings

Akker

Hofs

Hondorp

et al. 2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Remote participants in hybrid meetings often have problems to follow what is going on in the (physical) meeting room they are connected with. This paper describes a videoconferencing system for participation in hybrid meetings. The system has been developed as a research vehicle to see how technology based on automatic real-time recognition of conversational behavior in meetings can be used to improve engagement and floor control by remote participants. The system uses modules for online speech recognition, real-time visual focus of attention as well as a module that signals who is being addressed by the speaker. A built-in keyword spotter allows an automatic meeting assistant to call the remote participant's attention when a topic of interest is raised, pointing at the transcription of the fragment to help him catch-up.

show abstract

Section: Discussionmentioning

confidence: 99%

Supporting Engagement and Floor Control in Hybrid Meetings

Akker

Hofs

Hondorp

et al. 2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…The resolution is low and does not allow the analysis of fine details of participants' movements. In (Otsuka et al, 2008), two omnidirectional cameras with fish eye lenses are used. The system provides high resolution and 30 fps frame rate.…”

Section: The Emergent Leadership Synchronized Corpusmentioning

confidence: 99%

Emergent leaders through looking and speaking: from audio-visual data to multimodal recognition

Sanchez-Cortes

Aran

Kumar

et al. 2012

J Multimodal User Interfaces

View full text Add to dashboard Cite

In this paper we present a multimodal analysis of emergent leadership in small groups using audio-visual features and discuss our experience in designing and collecting a data corpus for this purpose. The ELEA AudioVisual Synchronized corpus (ELEA AVS) was collected using a light portable setup and contains recordings of small group meetings. The participants in each group performed the winter survival task and filled in questionnaires related to personality and several social concepts such as leadership and dominance. In addition, the corpus includes annotations on participants' performance in the survival task, and also annotations of social concepts from external viewers. Based on this corpus, we present the feasibility of predicting the emergent leader in small groups using automatically extracted audio and visual features, based on speaking turns and visual attention, and we focus specifically on multimodal features that make use of the looking at participants while speaking and looking at while not speaking measures. Our findings indicate that emergent leadership is related, but not equivalent, to dominance, and while multimodal features bring a moderate degree of effectiveness in inferring the leader, much simpler features extracted from the audio channel are found to give better performance.

show abstract

“…The orchestration engine produces then an orchestrated video chat by choosing at each point in time the perspective that best represents the social interaction based on decision-level rulebased fusion. In this context, TA2 presents several challenges: the results need to be computed in real-time with low affordable delay from spatially separated sensors (as opposed to other systems, such as [5,6,7], relying on collocated sensors) in open, unconstrained environment. Furthermore, the results are supposed to be localised in the image space to allow for a dynamic and seamless orchestrated video chat.…”

Section: Introductionmentioning

confidence: 99%

Multimodal Cue Detection Engine for Orchestrated Entertainment

Korchagin

Duffner

Motlíček

et al. 2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. In this paper, we describe a low delay real-time multimodal cue detection engine for a living room environment. The system is designed to be used in open, unconstrained environments to allow multiple people to enter, interact and leave the observable world with no constraints. It comprises detection and tracking of up to 4 faces, estimation of head poses and visual focus of attention, detection and localisation of verbal and paralinguistic events, their association and fusion. The system is designed as a flexible component to be used in conjunction with an orchestrated video conferencing system to improve the overall experience of interaction between spatially separated families and friends. Reduced latency levels achieved to date have shown improved responsiveness of the system.

show abstract

A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

Cited by 71 publications

References 16 publications

Supporting Engagement and Floor Control in Hybrid Meetings

Supporting Engagement and Floor Control in Hybrid Meetings

Emergent leaders through looking and speaking: from audio-visual data to multimodal recognition

Multimodal Cue Detection Engine for Orchestrated Entertainment

Contact Info

Product

Resources

About