Variable frame rate-based data augmentation to handle speaking-style variability for automatic speaker verification

Afshan, Amber; Guo, Jianmin; Park, Soo‐Jin; Ravi, Vijay; McCree, Alan; Alwan, Abeer

doi:10.48550/arxiv.2008.03616

Cited by 2 publications

(5 citation statements)

References 25 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Few studies have focused on whether these factors actually matter [34][35][36], although if they do, it may bias the evaluation of the evidence. In the present study, we use the three speech styles available in the dataset developed for forensic claims and compare the results depending on sample duration.…”

Section: Introductionmentioning

confidence: 99%

Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings

Schiferl

Fejes

2023

Journal of Forensic Sciences

View full text Add to dashboard Cite

In forensic voice comparison, deep learning has become widely popular recently. It is mainly used to learn speaker representations, called embeddings or embedding vectors. Speaker embeddings are often trained using corpora mostly containing widely spoken languages. Thus, language dependency is an important factor in automatic forensic voice comparison, especially when the target language is linguistically very different from that the model is trained on. In the case of a low‐resource language, developing a corpus for forensic purposes containing enough speakers to train deep learning models is costly. This study aims to investigate whether a model pre‐trained on multilingual (mostly English) corpus can be used on a target low‐resource language (here, Hungarian), not represented by the model. Often multiple samples are not available from the offender (unknown speaker). Samples are therefore compared pairwise with and without speaker enrollment for suspect (known) speakers. Two corpora are used that were developed especially for forensic purposes and a third that is meant for traditional speaker verification. Speaker embedding vectors are extracted by the x‐vector and ECAPA‐TDNN techniques. Speaker verification was evaluated in the likelihood‐ratio framework. A comparison is made between the language combinations (modeling, LR calibration, and evaluation). The results were evaluated by Cllrmin and EER metrics. It was found that the model pre‐trained on a different language but on a corpus with a significant number of speakers can be used on samples with language mismatch. Sample duration and speaking style also seem to affect the performance.

show abstract

Section: Introductionmentioning

confidence: 99%

Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings

Schiferl

Fejes

2023

Journal of Forensic Sciences

View full text Add to dashboard Cite

show abstract

“…The multi-condition training (MCT) [42] can be regarded as a special normalization approach, belonging to the scoring theme. It pools the data from both enrollment and test conditions and trains a multi-conditional PLDA.…”

Section: Related Workmentioning

confidence: 99%

“…Our approach (SD/LT as the first and simplest case) also belongs to the scoring theme, but it is fundamentally different from the normalization methods that pursue a conditioninsensitive model as in IDVC [35] or MCT [42]. Instead, it admits the discrepancy between the enrollment and test conditions, and models the statistics of speaker vectors in the two conditions respectively.…”

Section: Related Workmentioning

confidence: 99%

“…Therefore, any solution that does not address this incoherence is theoretically not optimal. For example, the commonly adopted multi-condition training (MCT) approach that pools data from both enrollment and test condition for PLDA training [42]. By this approach, the resultant PLDA is just an empirical compromise between the enrollment condition and the test condition, but is not optimal for either of them.…”

Section: Introductionmentioning

confidence: 99%

“…It is often significantly different from the enrollment condition, and varies from one test to another. Significant performance reduction is often observed with this mismatch [39], [40], [41], [42]. Some typical scenarios that involve serious enrollmenttest mismatch are:…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Principle Solution for Enroll-Test Mismatch in Speaker Recognition

Li¹,

Wang²,

Kang³

et al. 2020

Preprint

View full text Add to dashboard Cite

Mismatch between enrollment and test conditions causes serious performance degradation on speaker recognition systems. This paper presents a statistics decomposition (SD) approach to solve this problem. This approach is based on the normalized likelihood (NL) scoring framework, and is theoretically optimal if the statistics on both the enrollment and test conditions are accurate. A comprehensive experimental study was conducted on three datasets with different types of mismatch:(1) physical channel mismatch, (2) speaking style mismatch, (3) near-far recording mismatch. The results demonstrated that the proposed SD approach is highly effective, and outperforms the adhoc multi-condition training approach that is commonly adopted but not optimal in theory.

show abstract

Variable frame rate-based data augmentation to handle speaking-style variability for automatic speaker verification

Cited by 2 publications

References 25 publications

Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings

Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings

A Principle Solution for Enroll-Test Mismatch in Speaker Recognition

Contact Info

Product

Resources

About