Speaker diarization of French broadcast news

Gupta, Vishwa; Kenny, Patrick; Ouellet, Pierre; Dumouchel, Pierre

doi:10.1109/icassp.2008.4518622

Cited by 19 publications

(13 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, the scaling factor r was set equal to 0.30 and 0.17 for the DGA and ELDA set, respectively. We compare the results of the proposed algorithm to those obtain by CRIM's primary system, described in [14]. Like the proposed method, the system uses the AHC algorithm to merge segments.…”

Section: Resultsmentioning

confidence: 99%

Compensation for inter-frame correlations in speaker diarization and recognition

Stafylakis

Kenny

Gupta

et al. 2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

Self Cite

View full text Add to dashboard Cite

In this paper, we introduce the concept of the effective sample size to speaker diarization and recognition. We show why the use of the nominal sample size is inadequate to feature streams that exhibit inter-frame correlations and how it adversely affects inference. We then discuss the effective sample size, that is the sample size of a set of independent observations that carry the equivalent amount of statistical information about the model parameters and how the scaling factor can be estimated. Our experiments on speaker diarization show that once the effective sample size is adopted, state-of-the-art results can be attained even with single Gaussians and Hierarchical Clustering, and even when the scaling factor is set to be common for all utterances. On speaker recognition, encouraging results are reported on NIST-2010 using iVectors and PLDA.

show abstract

Section: Resultsmentioning

confidence: 99%

Compensation for inter-frame correlations in speaker diarization and recognition

Stafylakis

Kenny

Gupta

et al. 2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…In full batch processing, we normalize the features of each speaker in a room to zero mean and compute a 100-dimensional i-vector from this speaker in the room. In order to assign utterances in a room to speakers, we carry out speaker diarization using a modified version of the multi-stage segmentation and clustering system [42] as described before.…”

Section: Results Obtained With Full Batch Processingmentioning

confidence: 99%

“…7 Architecture of seven-layer DNN used with TRAP and i-vector features version of the multi-stage segmentation and clustering system [42]. The modification is that each utterance corresponds to one speaker.…”

Section: Algorithm Used For Decodingmentioning

confidence: 99%

Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

Alam

Gupta

Kenny

et al. 2015

EURASIP J. Adv. Signal Process.

Self Cite

View full text Add to dashboard Cite

The REVERB challenge provides a common framework for the evaluation of feature extraction techniques in the presence of both reverberation and additive background noise. State-of-the-art speech recognition systems perform well in controlled environments, but their performance degrades in realistic acoustical conditions, especially in real as well as simulated reverberant environments. In this contribution, we utilize multiple feature extractors including the conventional mel-filterbank, multi-taper spectrum estimation-based mel-filterbank, robust mel and compressive gammachirp filterbank, iterative deconvolution-based dereverberated mel-filterbank, and maximum likelihood inverse filtering-based dereverberated mel-frequency cepstral coefficient features for speech recognition with multi-condition training data. In order to improve speech recognition performance, we combine their results using ROVER (Recognizer Output Voting Error Reduction). For two-and eight-channel tasks, to get benefited from the multi-channel data, we also use ROVER, instead of the multi-microphone signal processing method, to reduce word error rate by selecting the best scoring word at each channel. As in a previous work, we also apply i-vector-based speaker adaptation which was found effective. In speech recognition task, speaker adaptation tries to reduce mismatch between the training and test speakers. Speech recognition experiments are conducted on the REVERB challenge 2014 corpora using the Kaldi recognizer. In our experiments, we use both utterance-based batch processing and full batch processing. In the single-channel task, full batch processing reduced word error rate (WER) from 10.0 to 9.3 % on SimData as compared to utterance-based batch processing. Using full batch processing, we obtained an average WER of 9.0 and 23.4 % on the SimData and RealData, respectively, for the two-channel task, whereas for the eight-channel task on the SimData and RealData, the average WERs found were 8.9 and 21.7 %, respectively.

show abstract

“…The most commonly used are the Gaussian mixture models and the hidden Markov models. 10,11,14,26,37,40 Also widely used are the support vector machines, 11,14,38,39,41 the artificial neural networks, 10 the k-nearest neighbor algorithm, 14,38 the decision trees, 10,38 the genetic algorithms, 2 the fuzzy logic 42 and boosting techniques. 41,43 Related architectures incorporate fusion frameworks among recognition models 28,44 and combination of model-based and distance based algorithms.…”

Section: Introductionmentioning

confidence: 99%

“…41,43 Related architectures incorporate fusion frameworks among recognition models 28,44 and combination of model-based and distance based algorithms. 13,26,27,39,40 Postprocessing schemes can improve the overall recognition accuracy. Among the postprocessing schemes are (i) transformation of the feature matrix, 23,[44][45][46] (ii) correction of logical errors based on empirical rules, 11 (iii) isolation of the segments of interest in cases where the post-processing is focused on specific classes 10,11,13,38,40,47 and (iv) merging of sound events and separation of them in a post-processing stage.…”

Section: Introductionmentioning

confidence: 99%

Data-Driven Audio Feature Space Clustering for Automatic Sound Recognition in Radio Broadcast News

Theodorou

Mporas

Lazaridis

et al. 2017

Int. J. Artif. Intell. Tools

View full text Add to dashboard Cite

Aiming to an automatic sound recognizer for radio broadcasting events, a methodology of clustering the audio feature space using the discrimination ability of the audio descriptors as a criterion, is investigated in this work. From a given and close set of audio events, commonly found in broadcast news transmissions, a large set of audio descriptors is extracted and their data-driven ranking of relevance is clustered, providing a more robust feature selection. The clusters of the feature space are feeding machine learning algorithms implemented as classification models during the experimental evaluation. This methodology showed that support vector machines provide significantly good results, considering the achieved accuracy due to their ability of coping well in high dimensionality experimental conditions.

show abstract

Speaker diarization of French broadcast news

Cited by 19 publications

References 6 publications

Compensation for inter-frame correlations in speaker diarization and recognition

Compensation for inter-frame correlations in speaker diarization and recognition

Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

Data-Driven Audio Feature Space Clustering for Automatic Sound Recognition in Radio Broadcast News

Contact Info

Product

Resources

About