Quality measures for speaker verification with short utterances

Poddar, Arnab; Sahidullah, Md; Saha, Goutam

doi:10.1016/j.dsp.2019.01.023

Cited by 15 publications

(3 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The performances of the automatic speaker verification systems degrade, due to the reduction in the amount of speech used for enrolment and verification. Combining multiple systems (based on different features and classifiers) can considerably reduce the speaker verification error-rate with short utterances [43].…”

Section: Proposed Methodsmentioning

confidence: 99%

Forensic Speaker Verification Using Ordinary Least Squares

Machado

Filho

Oliveira

2019

Sensors

View full text Add to dashboard Cite

In Brazil, the recognition of speakers for forensic purposes still relies on a subjectivity-based decision-making process through a results analysis of untrustworthy techniques. Owing to the lack of a voice database, speaker verification is currently applied to samples specifically collected for confrontation. However, speaker comparative analysis via contested discourse requires the collection of an excessive amount of voice samples for a series of individuals. Further, the recognition system must inform who is the most compatible with the contested voice from pre-selected individuals. Accordingly, this paper proposes using a combination of linear predictive coding (LPC) and ordinary least squares (OLS) as a speaker verification tool for forensic analysis. The proposed recognition technique establishes confidence and similarity upon which to base forensic reports, indicating verification of the speaker of the contested discourse. Therefore, in this paper, an accurate, quick, alternative method to help verify the speaker is contributed. After running seven different tests, this study preliminarily achieved a hit rate of 100% considering a limited dataset (Brazilian Portuguese). Furthermore, the developed method extracts a larger number of formants, which are indispensable for statistical comparisons via OLS. The proposed framework is robust at certain levels of noise, for sentences with the suppression of word changes, and with different quality or even meaningful audio time differences.

show abstract

Section: Proposed Methodsmentioning

confidence: 99%

Forensic Speaker Verification Using Ordinary Least Squares

Machado

Filho

Oliveira

2019

Sensors

View full text Add to dashboard Cite

show abstract

“…• There is a challenge in achieving high performance in speaker recognition systems based on short segment speech because the shorter the speech segment, the greater is the intra-speaker variability 48,49 . • Earlier works on multimodal speaker recognition systems have shown that performance improved either by using bone microphone speech or throat microphone speech in tandem with air microphone speech, as each of these alternate sensors capture complementary evidence.…”

Section: Short Speech Segments For Speaker Modelingmentioning

confidence: 99%

Recurrence plot embeddings as short segment nonlinear features for multimodal speaker identification using air, bone and throat microphones

Nawas,

Shahina,

Balachandar

et al. 2024

Sci Rep

View full text Add to dashboard Cite

Speech is produced by a nonlinear, dynamical Vocal Tract (VT) system, and is transmitted through multiple (air, bone and skin conduction) modes, as captured by the air, bone and throat microphones respectively. Speaker specific characteristics that capture this nonlinearity are rarely used as stand-alone features for speaker modeling, and at best have been used in tandem with well known linear spectral features to produce tangible results. This paper proposes Recurrent Plot (RP) embeddings as stand-alone, non-linear speaker-discriminating features. Two datasets, the continuous multimodal TIMIT speech corpus and the consonant-vowel unimodal syllable dataset, are used in this study for conducting closed-set speaker identification experiments. Experiments with unimodal speaker recognition systems show that RP embeddings capture the nonlinear dynamics of the VT system which are unique to every speaker, in all the modes of speech. The Air (A), Bone (B) and Throat (T) microphone systems, trained purely on RP embeddings perform with an accuracy of 95.81%, 98.18% and 99.74%, respectively. Experiments using the joint feature space of combined RP embeddings for bimodal (A–T, A–B, B–T) and trimodal (A–B–T) systems show that the best trimodal system (99.84% accuracy) performs on par with trimodal systems using spectrogram (99.45%) and MFCC (99.98%). The 98.84% performance of the B–T bimodal system shows the efficacy of a speaker recognition system based entirely on alternate (bone and throat) speech, in the absence of the standard (air) speech. The results underscore the significance of the RP embedding, as a nonlinear feature representation of the dynamical VT system that can act independently for speaker recognition. It is envisaged that speech recognition too will benefit from this nonlinear feature.

show abstract

“…However, the available documentation on the SiiP project does not indicate the actual performance of the system with real data and what characteristics of an audio sample such as length or quality are enough to identify a person in a large OSINT database or phone recordings. According to research, a small audio sample of 30-60 s length can be enough to verify the identity of a person in benchmark datasets (Poddar et al, 2019) yet the robustness of the tools depends on factors such as noise, heterogeneous speakers, heterogeneous recording devices or audio encoding. 7 Practitioners recognised the quality of voice samples needed for speaker identification as one of the key challenges with the project, OSINT data generating better results than phone recordings.…”

Section: Features Of Siipmentioning

confidence: 99%

Biometric identity systems in law enforcement and the politics of (voice) recognition: The case of SiiP

2021

View full text Add to dashboard Cite

Biometric identity systems are now a prominent feature of contemporary law enforcement, including in Europe. Often advanced on the premise of efficiency and accuracy, they have also been the subject of significant controversy. Much attention has focussed on longer-standing biometric data collection, such as finger-printing and facial recognition, foregrounding concerns with the impact such technologies can have on the nature of policing and fundamental human rights. Less researched is the growing use of voice recognition in law enforcement. This paper examines the case of the recent Speaker Identification Integrated Project, a European wide initiative to create the first international and interoperable database of voice biometrics, now the third largest biometric database at Interpol. Drawing on Freedom of Information requests, interviews and public documentation, we outline the emergence and features of SiiP and explore how voice is recognised and attributed meaning. We understand Speaker Identification Integrated Project as constituting a particular ‘regime of recognition’ premised on the use of soft biometrics (age, language, accent and gender) to disembed voice in order to optimise for difference. This, in turn, has implications for the nature and scope of law enforcement, people's position in society, and justice concerns more broadly.

show abstract

Quality measures for speaker verification with short utterances

Cited by 15 publications

References 54 publications

Forensic Speaker Verification Using Ordinary Least Squares

Forensic Speaker Verification Using Ordinary Least Squares

Recurrence plot embeddings as short segment nonlinear features for multimodal speaker identification using air, bone and throat microphones

Biometric identity systems in law enforcement and the politics of (voice) recognition: The case of SiiP

Contact Info

Product

Resources

About