P-Value Segment Selection Technique for Speaker Verification

Nosratighods, Mohaddeseh; Ambikairajah, Eliathamby; Epps, Julien; Carey, Michael J.

doi:10.1109/icassp.2007.366901

Cited by 3 publications

(3 citation statements)

References 11 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However such systems, which are invariably founded on the GMM/UBM paradigm [2], exhibit high sensitivity to the quantity of data, particularly the reference model data [3], [4], [5]. Their performance degrades strongly while reducing the duration of speech material available [6,7,8]. For situations where the speech duration is below 30 seconds, recognition performance falls rapidly [9,10].…”

Section: Introductionmentioning

confidence: 99%

Constrained temporal structure for text-dependent speaker verification

Larcher¹,

Bonastre²,

Mason³

2013

Digital Signal Processing

View full text Add to dashboard Cite

In the context of mobile devices, speaker recognition engines may suffer from ergonomic constraints and limited amount of computing resources. Even if they prove their efficiency in classical contexts, GMM/UBM systems show their limitations when restricting the quantity of speech data. In contrast, the proposed GMM/UBM extension addresses situations characterised by limited enrolment data and only the computing power typically found on modern mobile devices. A key contribution comes from the harnessing of the temporal structure of speech using client-customised pass-phrases and new Markov model structures. Additional temporal information is then used to enhance discrimination with Viterbi decoding, increasing the gap between client and imposter scores. Experiments on the MyIdea database are presented with a standard GMM/UBM configuration acting as a benchmark. When imposters do not know the client pass-phrase, a relative gain of up to 65% in terms of EER is achieved over the GMM/UBM baseline configuration. The results clearly highlight the potential of this new approach, with a good balance between complexity and recognition accuracy.

show abstract

Section: Introductionmentioning

confidence: 99%

Constrained temporal structure for text-dependent speaker verification

Larcher¹,

Bonastre²,

Mason³

2013

Digital Signal Processing

View full text Add to dashboard Cite

show abstract

“…Thus phonetic content variation, in addition to other factors such as the variability of the feature vector distribution from session to session, and MAP adaptation itself, has made some regions of the feature space less reliable in making the final decision. We have addressed the score variability caused by the lack of training data in our previous work [6,7] by dropping the nondiscriminative frames according to their target and impostor scores without making any a priori assumptions about the distributions of impostor and target scores.…”

Section: Introductionmentioning

confidence: 99%

“…Following on from our previous investigations [6,7], we now address the score variability caused by phonetic variation by emphasising the best scoring GMM frames that are strongly correlated with particular phonemes e.g. vowels and nasals [3].…”

Section: Introductionmentioning

confidence: 99%

Score weighting in speaker verification systems

Nosratighods

Ambikairajah

Epps

et al. 2007

2007 6th International Conference on Information, Communications &Amp; Signal Processing

Self Cite

View full text Add to dashboard Cite

This paper presents a method for re-weighting the frame-based scores of a speaker recognition system according to the discrimination level of the best matched Gaussian mixture for that frame. This approach focuses on particular feature space regions that either have been modeled accurately or contain the phonemes which are inherently most discriminative. The performance of individual Gaussian mixtures in terms of Equal Error Rate (EER) and minimum Detection Cost Function (DCF) on training, development and testing datasets consistently suggest that some Gaussian mixtures are inherently more discriminative regardless of their occurrence in training data. Therefore, it is possible to enhance the performance of speaker verification systems by re-weighting the frames that are mainly produced by those discriminative Gaussian mixtures. Compared with the baseline, results show a relative improvement of 5.82% and 5.46% on male speakers from the NIST 2002 dataset, in terms of EER and min DCF, respectively.

show abstract