Multimodal Speaker Diarization

Noulas, A.; Englebienne, Gwenn; Kröse, B.J.A.

doi:10.1109/tpami.2011.47

Cited by 60 publications

(60 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The work presented in [106] integrates audiovisual features for on-line audiovisual speaker diarization using a dynamic Bayesian network (DBN) but tests were limited to discussions with two to three people on two short test scenarios. Another use of DBN, also called factorial HMMs [107], is proposed in [108] as an audiovisual framework. The factorial HMM arises by forming a dynamic Bayesian belief network composed of several layers.…”

Section: Overlap Detectionmentioning

confidence: 99%

Speaker Diarization: A Review of Recent Research

Anguera

Bozonnet²,

Evans³

et al. 2012

IEEE Trans. Audio Speech Lang. Process.

550

349

View full text Add to dashboard Cite

Abstract-Speaker diarization is the task of determining "who spoke when?" in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.

show abstract

Section: Overlap Detectionmentioning

confidence: 99%

Speaker Diarization: A Review of Recent Research

Anguera

Bozonnet²,

Evans³

et al. 2012

IEEE Trans. Audio Speech Lang. Process.

550

349

View full text Add to dashboard Cite

show abstract

“…Speaker diarization seeks to answer the question of "who spoke when," often by clustering detected speech and mapping clusters to names [1]. Recently, [13] explored multimodal speaker diarization using a Dynamic Bayesian Network in both the business meeting and broadcast news videos. Several works extending from [5], have tried to tackle a similar problem using multimodal information for television shows but rely on the a priori presence of fully annotated transcripts that have names mapped to spoken text.…”

Section: Who Said Whatmentioning

confidence: 99%

Structured exploration of who, what, when, and where in heterogeneous multimedia news sources

Jou

Ellis

et al. 2013

Proceedings of the 21st ACM International Conference on Multimedia

View full text Add to dashboard Cite

We present a fully automatic system from raw data gathering to navigation over heterogeneous news sources, including over 18k hours of broadcast video news, 3.58M online articles, and 430M public Twitter messages. Our system addresses the challenge of extracting "who,""what,""when," and "where" from a truly multimodal perspective, leveraging audiovisual information in broadcast news and those embedded in articles, as well as textual cues in both closed captions and raw document content in articles and social media. Performed over time, we are able to extract and study the trend of topics in the news and detect interesting peaks in news coverage over the life of the topic. We visualize these peaks in trending news topics using automatically extracted keywords and iconic images, and introduce a novel multimodal algorithm for naming speakers in the news. We also present several intuitive navigation interfaces for interacting with these complex topic structures over different news sources.

show abstract

“…As the retrieval of information on people in videos is of high interest for users, research efforts have been devoted to unsupervised segmentation of videos into homogeneous segments according to person identity, like speaker diarization [21,17,29], face diarization [5,35], and audio-visual (AV) person diarization [10,25,16,8]. Combined with names extracted from overlaid text, AV person diarization makes it possible to identify people in videos [9].…”

Section: Introductionmentioning

confidence: 99%

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Odobez

2016

Proceedings of the 24th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which we have made publicly available. The proposed method achieves promising performance on this real world dataset as compared to several baselines.

show abstract

Multimodal Speaker Diarization

Cited by 60 publications

References 16 publications

Speaker Diarization: A Review of Recent Research

Speaker Diarization: A Review of Recent Research

Structured exploration of who, what, when, and where in heterogeneous multimedia news sources

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Contact Info

Product

Resources

About