Using Data-driven and Phonetic Units for Speaker Verification

Hannani, Asmaa El; Toledano, Doroteo T.; Petrovska‐Delacrétaz, Dijana; Montero-Asenjo, Alberto; Hennebert, Jean

doi:10.1109/odyssey.2006.248134

Cited by 7 publications

(5 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ALISP Language Model System ally to spectrally stable portions of the signal. We then compute the gravity center for each segment and train a gender In this system [9], the label sequences produced by the dependent vector quantizer to cluster these centers of grav-ALISP recognizer are used to train ALISP n-gram models ity. The codebook size (64 in our case) defines the number using the HTK Language Model (LM) tools (see 14th chapof ALISP symbols.…”

Section: Complementary Information One Set Of Information Reflectsmentioning

confidence: 99%

“…This way the availability of corpora is on the NIST 2006 Speaker Recognition Evaluation data much less an issue and the training corpus can be chosen to show that the data-driven features provide complementary match the working conditions as much as possible. information and the resulting fused system reduced the erThis paper is the continuation of previous attempts ror rate in comparison to the GMM baseline system.to model high-level information using data-driven approaches [9,8]. The focus here is on the fusion of different systems that exploit data-driven high-level source of in-…”

mentioning

confidence: 98%

“…to model high-level information using data-driven approaches [9,8]. The focus here is on the fusion of different systems that exploit data-driven high-level source of in-…”

mentioning

confidence: 99%

See 2 more Smart Citations

Data-Driven High-Level Information for Text-Independent Speaker Verification

Hannani¹,

Petrovska‐Delacrétaz²

2007

2007 IEEE Workshop on Automatic Identification Advanced Technologies

Self Cite

View full text Add to dashboard Cite

only the acoustic content of speech to trying to exploit highlevel information. It has been reported in several studies that Recently, various studies have shown that high-levelfeagains in speaker recognition accuracy are possible by extures, such as linguistic content, pronunciation and idiolecploiting such high-level information sources (see e.g.[17]). tal word usage, convey more speaker information and can The most examined high-level information for speaker be added to the low-level features in order to increase the verification are: the prosody [19, 1], the phonetic informarobustness of the system. Usually these features are extion [15,14,13,11], and the idiolectal word and phone tracted by analyzing streams produced by phonetic speech usage [7, 2, 5, 12]. All these approaches reported encourrecognition systems. Two of the major problems that arise aging results and were found to provide features complewhen phone based systems are being developed are the posmentary to short-term acoustic features. However most of sible mismatches between the development and evaluation them are based on phonetic transcriptions that are errordata and the lack of transcribed databases. We propose in prone and expensive to create. Beside this, the transcribed this paper to replace the phone-based approaches by datadatabases need also to be updated with new data sets in ordriven segmentation methodologies. Our data-driven highder to match with potentially new specifications (channel, level systems do not use transcribed data and can easily be microphones, context of use, ...) of the verification data.applied on development data minimizing the mismatches. An alternative approach that solves these two problems is These systems were fused with a state-of-the-art acoustic using data-driven phone-like units derived directly from unGaussian Mixture Models (GMM) system. Results obtained transcribed speech. This way the availability of corpora is on the NIST 2006 Speaker Recognition Evaluation data much less an issue and the training corpus can be chosen to show that the data-driven features provide complementary match the working conditions as much as possible. information and the resulting fused system reduced the erThis paper is the continuation of previous attempts ror rate in comparison to the GMM baseline system.to model high-level information using data-driven approaches [9,8]. The focus here is on the fusion of different systems that exploit data-driven high-level source of in-

show abstract

Section: Complementary Information One Set Of Information Reflectsmentioning

confidence: 99%

mentioning

confidence: 98%

See 1 more Smart Citation

Data-Driven High-Level Information for Text-Independent Speaker Verification

Hannani¹,

Petrovska‐Delacrétaz²

2007

2007 IEEE Workshop on Automatic Identification Advanced Technologies

Self Cite

View full text Add to dashboard Cite

show abstract

“…ASR [36] Data-driven temporal filters are designed using PCA, LDA and minimum classification error (MCE) framework. ASR [37] Speech segments are created using a data-driven and automatic language independent speech processing (ALISP). ASV [38] This work uses F-ratio to adjust the center and edge frequencies of the filterbank and the F-ratio is computed for speaker separability.…”

Section: Introductionmentioning

confidence: 99%

Optimization of data-driven filterbank for automatic speaker verification

Sarangi

Sahidullah

Saha

2020

Digital Signal Processing

View full text Add to dashboard Cite

Most of the speech processing applications use triangular filters spaced in mel-scale for feature extraction. In this paper, we propose a new data-driven filter design method which optimizes filter parameters from a given speech data. First, we introduce a frameselection based approach for developing speech-signal-based frequency warping scale. Then, we propose a new method for computing the filter frequency responses by using principal component analysis (PCA). The main advantage of the proposed method over the recently introduced deep learning based methods is that it requires very limited amount of unlabeled speech-data. We demonstrate that the proposed filterbank has more speaker discriminative power than commonly used mel filterbank as well as existing data-driven filterbank. We conduct automatic speaker verification (ASV) experiments with different corpora using various classifier back-ends. We show that the acoustic features created with proposed filterbank are better than existing mel-frequency cepstral coefficients (MFCCs) and speech-signal-based frequency cepstral coefficients (SFCCs) in most cases. In the experiments with VoxCeleb1 and popular i-vector back-end, we observe 9.75% relative improvement in equal error rate (EER) over MFCCs. Similarly, the relative improvement is 4.43% with recently introduced x-vector system. We obtain further improvement using fusion of the proposed method with standard MFCC-based approach.

show abstract

“…The units examined in the past include word N-grams, syllables, phones, and Automatic Language Independent Speech Processing (ALISP) units [4] (which are designed to mimic the phones) and MLP-based phonetic units [5]. Many of the units, such as the words and phones, are used only because their transcripts are readily available via Automatic Speech Recognition, and are incorporated without regard to their actual speaker discriminative abilities.…”

Section: Introductionmentioning

confidence: 99%

Towards Structured Approaches to Arbitrary Data Selection and Performance Prediction for Speaker Recognition

Lei

2009

Advances in Biometrics

View full text Add to dashboard Cite

Abstract. We developed measures relating feature vector distributions to speaker recognition (SR) performances for performance prediction and potential arbitrary data selection for SR. We examined the measures of mutual information, kurtosis, correlation, and measures pertaining to intraand inter-speaker variability. We applied the measures on feature vectors of phones to determine which measures gave good SR performance prediction of phones standalone and in combination. We found that mutual information had an -83.5% correlation with the Equal Error Rates (EERs) of each phone. Also, Pearson's correlation between the feature vectors of two phones had a -48.6% correlation with the relative EER improvement of the score-level combination of the phones. When implemented in our new data-selection scheme (which does not require a SR system to be run), the measures allowed us to select data with 2.13% overall EER improvement (on SRE08) over data selected via a brute-force approach, at a fifth of the computational costs.

show abstract

Using Data-driven and Phonetic Units for Speaker Verification

Cited by 7 publications

References 9 publications

Data-Driven High-Level Information for Text-Independent Speaker Verification

Data-Driven High-Level Information for Text-Independent Speaker Verification

Optimization of data-driven filterbank for automatic speaker verification

Towards Structured Approaches to Arbitrary Data Selection and Performance Prediction for Speaker Recognition

Contact Info

Product

Resources

About