An Information Theoretic Combination of MFCC and TDOA Features for Speaker Diarization

Vijayasenan, Deepu; Valente, Fabio; Bourlard, Hervé

doi:10.1109/tasl.2010.2048603

Cited by 31 publications

(25 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As previously pointed in [11], the IB system appears more robust to variation of weights across different meeting recordings. This robustness holds also in case of all-pairs TDOA feature vectors.…”

Section: Methodssupporting

confidence: 65%

“…Whenever multiple features are available, the combination is performed in the space of relevance variables Y [11]. Separate GMMs with the same number of components are trained for each feature stream.…”

Section: Information Bottleneck Diarizationmentioning

confidence: 99%

“…After clustering, the speaker boundaries are realigned. Instead of using HMM/GMMs, the realignment is performed in the space of relevance variables p(y|x) using a Kullback-Leibler divergence based HMM system described in [11].…”

Section: Information Bottleneck Diarizationmentioning

confidence: 99%

See 2 more Smart Citations

Speaker diarization of meetings based on large TDOA feature vectors

Vijayasenan

Valente

2012

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper investigates the use of large TDOA feature vectors together with acoustic information in speaker diarization of meetings. TDOAs are obtained by considering all possible microphones pairs and this approach is compared with conventional TDOA features extracted w.r.t. a reference channel. The study is carried using two systems, the first based on Gaussian Mixture Modeling and the second based on the Information Bottleneck approach. Results on NIST RT06/RT07/RT09 evaluation datasets show a large speaker error reduction of 30% relative going from 14.3% to 10.8% for the first and from 12.3% to 8.2% for the second whenever the feature weighting is properly handled. Furthermore results reveal that the IB system is more robust to different number of microphones even when all pairs large TDOA vectors are used thus outperforming the HMM/GMM by 25% relative (8.2% error compared to 10.8%).

show abstract

Section: Methodssupporting

confidence: 65%

Section: Information Bottleneck Diarizationmentioning

confidence: 99%

See 1 more Smart Citation

Speaker diarization of meetings based on large TDOA feature vectors

Vijayasenan

Valente

2012

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, most of the existing methods assume that the microphone location is given to estimate the direction of arrival of speakers [3]- [6]. Some methods using Time Difference Of Arrival (TDOA) have been proposed [7]- [9], which do not assume the known microphone location. These methods propose using HMM for speaker segmentation and clustering, as well as hierarchical agglomerative clustering using spacial information.…”

Section: Introductionmentioning

confidence: 99%

“…These methods propose using HMM for speaker segmentation and clustering, as well as hierarchical agglomerative clustering using spacial information. However, the methods have difficulty with overlapping speech [7] and estimating the number of speakers deterministically [8], [9].…”

Section: Introductionmentioning

confidence: 99%

Blind spatial sound source clustering and activity detection using uncalibrated microphone array

Nakamura

Mizumoto

2017

2017 25th European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

Abstract-This paper presents a method for estimating the number, as well as the activity periods of spatially distributed sound sources using an uncalibrated microphone array. This methodology is applied for the purposes of speaker diarization. In general, speaker diarization has difficulty with: 1) estimating the number of sound sources (speakers), and 2) activity detection of multiple sound sources including overlap of utterances. Several microphone array based techniques have already tackled these challenges. However, existing methods mainly assume that the steering vectors for the microphone array are calibrated in advance to identify sound sources, which is difficult to satisfy when ad-hoc or flexible microphone arrays are used. Thus our approach estimates the number of sound sources blindly in two steps. First, Time Delay of Arrival (TDOA) of the observed signal is clustered. Second, the sound source activity is detected by clustering the long-term spatial spectrum using the TDOA based steering vector for each cluster. The validity of the algorithm is confirmed by both synthesized signals and a real-world flexible microphone array application.

show abstract

A hybrid HXPLS‐TMFCC parameterization and DCNN‐SFO clustering based speaker diarization system

Sailaja

Maloji

Mannepalli

2022

Concurrency and Computation

View full text Add to dashboard Cite

The speaker diarization is considered to be the process by which the speaker signal is segmented, and the speaker identity is grouped into homogenous regions. The central point behind this scheme is the ability to distinguish between the speaker signal and each speaker signal with the label. As mass communication and meetings grow quickly, the diarization of the speakers is burden to improve the readability of the speech transcript. To solve this problem, tangent weighted mel‐frequency cepstral coefficient (TMFCC) and the extended linear prediction with autocorrelation snapshot feature extraction and the speaker diarization approach proposes a deep convolutional neural network (DCNN) for clustering and optimization using sailfish optimizer. A new development in the HXLPS extraction method is the holoentropy with extended linear prediction with autocorrelation snapshot. TMFCC makes more efficient and improves the effectiveness of the proposed scheme using lesser energy frame and higher energy framework. When achieve this, the voice activity detection method can recognize speech and non‐speech signals. Therefore, every segmented signal is represented by the d‐vector. The label of the speaker signal is clustered according to the speaker label used in the DCNN. The evaluation methods, like tracking distance, false alarm rate, diarization error rate examine the effectiveness.

show abstract

An Information Theoretic Combination of MFCC and TDOA Features for Speaker Diarization

Cited by 31 publications

References 14 publications

Speaker diarization of meetings based on large TDOA feature vectors

Speaker diarization of meetings based on large TDOA feature vectors

Blind spatial sound source clustering and activity detection using uncalibrated microphone array

A hybrid HXPLS‐TMFCC parameterization and DCNN‐SFO clustering based speaker diarization system

Contact Info

Product

Resources

About