2012
DOI: 10.1109/tpami.2011.47
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Speaker Diarization

Abstract: We present a novel probabilistic framework that fuses information coming from the audio and video modality to perform speaker diarization. The proposed framework is a Dynamic Bayesian Network (DBN) that is an extension of a factorial Hidden Markov Model (fHMM) and models the people appearing in an audiovisual recording as multimodal entities that generate observations in the audio stream, the video stream, and the joint audiovisual space. The framework is very robust to different contexts, makes no assumptions… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
56
0
4

Year Published

2012
2012
2020
2020

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 60 publications
(60 citation statements)
references
References 16 publications
0
56
0
4
Order By: Relevance
“…The work presented in [106] integrates audiovisual features for on-line audiovisual speaker diarization using a dynamic Bayesian network (DBN) but tests were limited to discussions with two to three people on two short test scenarios. Another use of DBN, also called factorial HMMs [107], is proposed in [108] as an audiovisual framework. The factorial HMM arises by forming a dynamic Bayesian belief network composed of several layers.…”
Section: Overlap Detectionmentioning
confidence: 99%
“…The work presented in [106] integrates audiovisual features for on-line audiovisual speaker diarization using a dynamic Bayesian network (DBN) but tests were limited to discussions with two to three people on two short test scenarios. Another use of DBN, also called factorial HMMs [107], is proposed in [108] as an audiovisual framework. The factorial HMM arises by forming a dynamic Bayesian belief network composed of several layers.…”
Section: Overlap Detectionmentioning
confidence: 99%
“…Speaker diarization seeks to answer the question of "who spoke when," often by clustering detected speech and mapping clusters to names [1]. Recently, [13] explored multimodal speaker diarization using a Dynamic Bayesian Network in both the business meeting and broadcast news videos. Several works extending from [5], have tried to tackle a similar problem using multimodal information for television shows but rely on the a priori presence of fully annotated transcripts that have names mapped to spoken text.…”
Section: Who Said Whatmentioning
confidence: 99%
“…As the retrieval of information on people in videos is of high interest for users, research efforts have been devoted to unsupervised segmentation of videos into homogeneous segments according to person identity, like speaker diarization [21,17,29], face diarization [5,35], and audio-visual (AV) person diarization [10,25,16,8]. Combined with names extracted from overlaid text, AV person diarization makes it possible to identify people in videos [9].…”
Section: Introductionmentioning
confidence: 99%