Overlapped speech detection for improved speaker diarization in multiparty meetings

562

349

Abstract-Speaker diarization is the task of determining "who spoke when?" in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.

Section: Overlap Detectionmentioning

confidence: 99%

Section: Overlap Detectionmentioning

confidence: 99%

Speaker Diarization: A Review of Recent Research

Anguera

Bozonnet²,

Evans³

et al. 2012

562

349

“…Although we feel that our approach is promising, we clearly need to perform more research to improve our overlap detection system. Note that a number of other research institutes currently also investigate the overlapping speech problem [12], [16].…”

Section: A Top-down Analysismentioning

confidence: 99%

Speaker Diarization Error Analysis Using Oracle Components

Huijbregts

Leeuwen

Wooters³

2012

Abstract-In this paper we describe an analysis of our speaker diarization system based on a series of oracle experiments. In this analysis, each system component is substituted by an oracle component that uses the reference transcripts to perform flawlessly. By placing the original components back into the system one at a time, either in a top-down or bottom-up manner, the performance of each individual system component is measured. The analysis approach can be applied to any speaker diarization system that consists of a concatenation of separate components. Our experimental findings are relevant for most RT09s diarization systems that all apply similar techniques. The analysis revealed that three components caused most errors: speech activity detection, the inability to handle overlapping speech and robustness of the merging component to cluster impurity.

“…This is due to the high degree of overlapping speech in this dataset (13.6% for RT'09 cf. 7.6% for RT'07) which is well known to have a significant impact on the performance of state-of-the-art speaker diarization systems [41]. Speaker diarization performance using a top-down approach is illustrated on row 5 of Table II.…”

Section: Diarization Performancementioning

confidence: 99%

A Comparative Study of Bottom-Up and Top-Down Approaches to Speaker Diarization

Evans

Bozonnet

Wang

et al. 2012

Abstract-This paper presents a theoretical framework to analyze the relative merits of the two most general, dominant approaches to speaker diarization involving bottom-up and top-down hierarchical clustering. We present an original qualitative comparison which argues how the two approaches are likely to exhibit different behavior in speaker inventory optimization and model training: bottom-up approaches will capture comparatively purer models and will thus be more sensitive to nuisance variation such as that related to the speech content; top-down approaches, in contrast, will produce less discriminative speaker models but, importantly, models which are potentially better normalized against nuisance variation. We report experiments conducted on two standard, single-channel NIST RT evaluation datasets which validate our hypotheses. Results show that competitive performance can be achieved with both bottom-up and top-down approaches (average DERs of 21% and 22%), and that neither approach is superior. Speaker purification, which aims to improve speaker discrimination, gives more consistent improvements with the top-down system than with the bottom-up system (average DERs of 19% and 25%), thereby confirming that the top-down system is less discriminative and that the bottom-up system is less stable. Finally, we report a new combination strategy that exploits the merits of the two approaches. Combination delivers an average DER of 17% and confirms the intrinsic complementary of the two approaches.