Speaker Diarization: A Review of Recent Research

Anguera, Xavier; Bozonnet, Simon; Evans, Nicholas; Fredouille, Corinne; Friedland, Gerald; Vinyals, Oriol

doi:10.1109/tasl.2011.2125954

Cited by 563 publications

(379 citation statements)

References 87 publications

Supporting

Mentioning

375

Contrasting

Unclassified

Order By: Relevance

“…This error is referred to as speech time error in the results computed by NIST tools 1 . Others [13] choose to report the FA speaker and Miss speaker errors inclusive of overlap, e.g. a segment which contains two speakers that has been completely missed by the system will have twice the error.…”

Section: Diarization Error Ratementioning

confidence: 99%

Where are the challenges in speaker diarization?

Sinclair

King

2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

We present a study on the contributions to Diarization Error Rate by the various components of speaker diarization system. Following on from an earlier study by Huijbregts and Wooters, we extend into more areas and draw somewhat different conclusions. From a series of experiments combining real, oracle and ideal system components, we are able to conclude that the primary cause of error in diarization is the training of speaker models on impure data, something that is in fact done in every current system. We conclude by suggesting ways to improve future systems, including a focus on training the speaker models from smaller quantities of pure data instead of all the data, as is currently done.

show abstract

Section: Diarization Error Ratementioning

confidence: 99%

Where are the challenges in speaker diarization?

Sinclair

King

2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

show abstract

“…The task of speaker diarisation is an important prerequisite task for audio indexing, automatic speech recognition (ASR) and more [1,2]. The objective is to split the audio into segments which are associated with a single speaker, and to identify among the set of segments those that are spoken by the same speaker.…”

Section: Introductionmentioning

confidence: 99%

DNN approach to speaker diarisation using speaker channels

Milner

Hain

2017

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speaker diarisation addresses the question of "who speaks when" in audio recordings, and has been studied extensively in the context of tasks such as broadcast news, meetings, etc. Performing diarisation on individual headset microphone (IHM) channels is sometimes assumed to easily give the desired output of speaker labelled segments with timing information. However, it is shown that given imperfect data, such as speaker channels with heavy crosstalk and overlapping speech, this is not the case. Deep neural networks (DNNs) can be trained on features derived from the concatenation of speaker channel features to detect which is the correct channel for each frame. Crosstalk features can be calculated and DNNs trained with or without overlapping speech to combat problematic data. A simple frame decision metric of counting occurrences is investigated as well as adding a bias against selecting nonspeech for a frame. Finally, two different scoring setups are applied to both datasets. The stricter SHEF setup finds diarisation error rates (DER) of 9.2% on TBL and 23.2% on RT07 while the NIST setup achieves 5.7% and 15.1% respectively.

show abstract

“…Speaker diarization (the process of partitioning an input audio stream into homogeneous segments according to speaker identity), when used together with speaker recognition (the identification of speakers by their voices), has become an important key technology for tasks such as navigation, retrieval, and high-level inference from audio data in meeting recordings. Some speaker diarization systems integrate motion and gazing data analyses with audio data analysis to achieve higher accuracy and robustness (Anguera et al 2012;Moattar and Homayounpour 2012). There are also meeting systems that use multimodal data including both motion and gaze (Hain et al 2010;Tur et al 2008).…”

Section: Introductionmentioning

confidence: 99%

Multimodal corpus of multiparty conversations in L1 and L2 languages and findings obtained from it

Yamamoto

Taguchi

Ijuin

et al. 2015

Lang Resources & Evaluation

View full text Add to dashboard Cite

To investigate the differences in communicative activities by the same interlocutors in Japanese (their L1) and in English (their L2), an 8-h multimodal corpus of multiparty conversations was collected. Three subjects participated in each conversational group, and they had conversations on free-flowing and goaloriented topics in Japanese and in English. Their utterances, eye gazes, and gestures were recorded with microphones, eye trackers, and video cameras. The utterances and eye gazes were manually annotated. Their utterances were transcribed, and the transcriptions of each participant were aligned with those of the others along the time axis. Quantitative analyses were made to compare the communicative activities caused by the differences in conversational languages, the conversation types, and the levels of language expertise in L2. The results reveal different utterance characteristics and gaze patterns that reflect the differences in difficulty felt by the participants in each conversational condition. Both total and average durations of utterances were shorter in their L2 than in their L1 conversations. Differences in eye gazes were mainly found in those toward the information senders: Speakers were gazed at more in their second-language than in their native-language conversations. Our findings on the characteristics of conversations in the second language suggest possible directions for future research in psychology, cognitive science, and humancomputer interaction technologies.

show abstract

Speaker Diarization: A Review of Recent Research

Cited by 563 publications

References 87 publications

Where are the challenges in speaker diarization?

Where are the challenges in speaker diarization?

DNN approach to speaker diarisation using speaker channels

Multimodal corpus of multiparty conversations in L1 and L2 languages and findings obtained from it

Contact Info

Product

Resources

About