Speaker Segmentation and Clustering using Gender Information

Ore, Brian; Slyh, Raymond E.; Hansen, Eric G.

doi:10.1109/odyssey.2006.248125

Cited by 2 publications

(5 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The training conditions all involved four-wire (two-channel) conversations and were defined by the following amounts of data: (1) an excerpt esti- 'The AFRLI/EC system submitted for the conditions requiring mated to contain approximately 10 seconds of speech of speaker segmentation and clustering is described in [2]. the target on its designated side (designated as 10sec4w), The GMM-based systems, regardless of feature set, all (designated as 10sec4w) or (2) one five-minute converused Version 2.1 of the MIT Lincoln Laboratory (MITsation (designated as lconv4w) as in the 10sec4w and LL) MFCC/GMM system [5] with 2048 mixtures per I conv4w training conditions, respectively, model and diagonal covariance matrices for each mixture.…”

mentioning

confidence: 99%

“…the target on its designated side (designated as 10sec4w), The GMM-based systems, regardless of feature set, all (designated as 10sec4w) or (2) one five-minute converused Version 2.1 of the MIT Lincoln Laboratory (MITsation (designated as lconv4w) as in the 10sec4w and LL) MFCC/GMM system [5] with 2048 mixtures per I conv4w training conditions, respectively, model and diagonal covariance matrices for each mixture. In addition to the speech files, NIST provided tranAll of the GMM-based systems used a common scripts produced by an English-language speech recogspeech activity detector (SAD), 2 which worked in three nition system from BBN with word error rates typically stages. The first stage utilized a two-state speech/nonin the range of 15-30% for English conversational telespeech Hidden Markov Model (HMM) with MFCCs as phone speech.…”

mentioning

confidence: 99%

See 1 more Smart Citation

The 2005 AFRL/HEC One-Speaker Detection Systems

Slyh

Hansen

Ore

2006

2006 IEEE Odyssey - The Speaker and Language Recognition Workshop

Self Cite

View full text Add to dashboard Cite

Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing this collection of information Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Department of Defense, Washington Headquarters Services, Directorate for Information , 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS. REPORT DATE (DD-MM-YYYY)2. REPORT TYPE 3. DATES COVERED (From -To) 14. ABSTRACT This paper describes the one-speaker detection systems submitted by AFRL/HEC for several of the training and testing conditions in the 2005 NIST Speaker Recognition Evaluation. For each condition, the overall system score was the weighted combination of scores from several component systems. The component systems were based on (1) mel-frequency cepstral coefficients (MFCCs) and Gaussian mixture models (GMMs); (2) MFCCs and phonemespecific GMMs (PS-GMMs); (3) linear-prediction-based cepstral coefficients (LPCCs) from closed-phase analysis; (4) formant center frequencies, formant bandwidths, and fundamental frequency )FMBWFO); and (5) word language modeling (WLM). The score combination was done using single-layer perceptrons, with the grouping of the component systems depending on the lengths of the training and testing files. For some of the testing and/or training conditions involving ten-second speech files, the system performance improved from the inclusion of the FMBWFO and LPCC systems, while the MFCC/PS-GMM system provided additional benefits in the oneconversation testing conditions involving laroer amounts of training data- (5) language modeling on the words from speech recoging and testing conditions in the 2005 NIST Speaker nition transcripts (denoted here by WLM). For testing or Recognition Evaluation. For each condition, the over training conditions involving short speech files, the scores all system score was the weighted combination of scores from the MFCC, FMBWFO, and LPCC systems were from several component systems. The component syscombined using a single-layer perceptron (SLP). For testterns were based on (1) mel-frequency cepstral coeffiing and training conditions involving larger amounts of cients (MFCCs) and Gaussian mixture models (GMMs); speech data, the score combination was done in two (2) MFCCs and phoneme-specific GMMs (PS-GMMs);stages. First, the scores from fifteen PS-GMM systems (3) linear-prediction-based cepstral coefficients (LPCCs) were combined using an SLP. Then, the output score from from closed-phase analysis; (4) formant center freq...

show abstract

mentioning

confidence: 99%

mentioning

confidence: 99%

The 2005 AFRL/HEC One-Speaker Detection Systems

Slyh

Hansen

Ore

2006

2006 IEEE Odyssey - The Speaker and Language Recognition Workshop

Self Cite

View full text Add to dashboard Cite

show abstract

“…Application examples of the above include telephony mixed-channel speaker verification [Ore et al, 2006;Deng et al, 2006] and acoustic model adaptation for speech recognition [Pusateri & Hazen, 2002;Hain et al, 2006;Janin et al, 2006]. In the mixed-channel speaker verification task, speaker specific models after training are used to perform verification against a designated reference model.…”

Section: Speaker Clusteringmentioning

confidence: 99%

“…Segmentation is performed at these locations and further speaker clustering can then be done so as to determine the identity of the speakers present. This is the strategy that was used in papers such as [Siu et al, 1992;Wegmann et al, 1999b;Kemp et al, 2000;Ore et al, 2006]. In [Wegmann et al, 1999b], an amplitude based silence detector is used as a first pass to break up continuous broadcast news recordings into segments.…”

Section: Segmentation Using Silencementioning

confidence: 99%

“…In [Gish et al, 1991], speaker segmentation and clustering was performed on audio recordings of the radio dialogs between airport traffic controllers and pilots. Papers such as [Ore et al, 2006;Deng et al, 2006] have addressed performing diarization of mixed-channel telephone conversations as a first step to speaker verification in the NIST 2006 Speaker Recognition evaluations [NIST, 2006b]. Two of the most commonly researched domains is that of speaker diarization of broadcast news [Kubala et al, 1998;Cook & Robinson, 1998;Pallett et al, 1998] and meeting recordings [Fiscus et al, 2005;Anguera et al, 2005b;Hain et al, 2005].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Speaker diarization of news broacasts and meeting recordings

Koh¹

View full text Add to dashboard Cite

Given a piece of audio recording, the task of speaker diarization can be summarized as answering the question of "Who spoke when ?". This thesis offers a review of the techniques and issues relating to performing speaker diarization on broadcast news recordings, as well as meeting recordings. The broadcast news domain is generally regarded to be simpler because the turn taking between speakers is better controlled and audio quality tends to be higher. The typical approach used for this domain consist of two steps-speaker segmentation and then speaker clustering. The Bayesian Information Criterion (BIC) has been a very popular distance measure for both speaker segmentation and clustering. Experiments were conducted that confirmed the effectiveness of this distance measure for segmentation and clustering. Further speaker segmentation experiments were performed using the Hotelling's T 2 statistic to augment the BIC. It was observed that while this does speed up processing, the segmentation F Score obtained does not match up to that reported in the literature. A novel speaker clustering approach was also introduced where polynomial expanded feature vectors were used to compute the distance between clusters. It was found that this approach could produce results comparable to that for the BIC. In order to address the problem of speaker diarization for the meeting domain, a diarization system was developed and submitted for the NIST Rich Transcription 2007 (RT-07) evaluation. This diarization system exploited the diversity of meeting recording channels by performing Time Delay of Arrival (TDOA) estimation using a Normalized Least Means Squared (NLMS) filter. Subsequent performance enhancements were delivered by adding a cluster purification module, as well as a Non-Speech & Silence Removal (NS&SR) module. An overall Diarization Error Rate (DER) of 15.32% was obtained for the RT-07 corpus. This score was found to be competitive against the other entrants in the evaluation exercise.

show abstract

Speaker Segmentation and Clustering using Gender Information

Cited by 2 publications

References 13 publications

The 2005 AFRL/HEC One-Speaker Detection Systems

The 2005 AFRL/HEC One-Speaker Detection Systems

Speaker diarization of news broacasts and meeting recordings

Contact Info

Product

Resources

About