Towards understanding speaker discrimination abilities in humans and machines for text-independent short utterances of different speech styles

Park, Soo‐Jin; Yeung, Gary; Vesselinova, Neda; Kreiman, Jody; Keating, Patricia; Alwan, Abeer

doi:10.1121/1.5045323

Cited by 13 publications

(10 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As hypothesized, the worst performance for both humans and machines was obtained for style mismatched read speech -conversation trials. Results were consistent with the hypothesis of our previous study [9] of read and pet-directed speech from the same set of speakers. That study showed that humans consistently performed better than machines in both read speech -read speech (EER = 19.02% versus 30.31%) and read speech -pet-directed speech (EER = 39.23% versus 44.17% ) trials.…”

Section: Human and Machine Performancesupporting

confidence: 92%

“…For example, style variability confuses ear witnesses hearing a criminal shouting vs reading aloud during a voice lineup [8]. Human and machine speaker discrimination performances have been compared when style changed from read to pet-directed speech, which is characterized by exaggerated prosody [9]. In both examples, differences in style were extreme, and little is known about how moderate variations in style, for example between read and conversational speech, affect the relative performance of humans vs machines in speaker discrimination performance.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Speaker discrimination in humans and machines: Effects of speaking style variability

Afshan¹,

Kreiman²,

Alwan³

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Does speaking style variation affect humans' ability to distinguish individuals from their voices? How do humans compare with automatic systems designed to discriminate between voices? In this paper, we attempt to answer these questions by comparing human and machine speaker discrimination performance for read speech versus casual conversations. Thirty listeners were asked to perform a same versus different speaker task. Their performance was compared to a state-of-the-art xvector/PLDA-based automatic speaker verification system. Results showed that both humans and machines performed better with style-matched stimuli, and human performance was better when listeners were native speakers of American English. Native listeners performed better than machines in the stylematched conditions (EERs of 6.96% versus 14.35% for read speech, and 15.12% versus 19.87%, for conversations), but for style-mismatched conditions, there was no significant difference between native listeners and machines. In all conditions, fusing human responses with machine results showed improvements compared to each alone, suggesting that humans and machines have different approaches to speaker discrimination tasks. Differences in the approaches were further confirmed by examining results for individual speakers which showed that the perception of distinct and confused speakers differed between human listeners and machines.

show abstract

Section: Human and Machine Performancesupporting

confidence: 92%

Section: Introductionmentioning

confidence: 99%

Speaker discrimination in humans and machines: Effects of speaking style variability

Afshan¹,

Kreiman²,

Alwan³

2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Additionally, the impact of vocal effort ranging from whisper (Vestman et al, 2018) to shout (Hanilci et al, 2013) and scream (similar to shout but lacking phonemic structure) (Hansen et al, 2017) has been addressed in many studies. Other examples include acted speech by naive or professional (Pietrowicz et al, 2017) speakers, pet-directed speech (Park et al, 2018), and the impact of varied speech FIG. 1.…”

Section: Related Workmentioning

confidence: 99%

On the limits of automatic speaker verification: Explaining degraded recognizer scores through acoustic changes resulting from voice disguise

Hautamäki

Kinnunen

2019

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

In speaker verification research, objective performance benchmarking of listeners and automatic speaker verification (ASV) systems are of key importance in understanding the limits of speaker recognition. While the adoption of common data and metrics has been instrumental to progress in ASV, there are two major shortcomings. First, the utterances lack intentional voice changes imposed by the speaker. Second, the standard evaluation metrics focus on average performance across all speakers and trials. As a result, a knowledge gap remains in how the acoustic changes impact recognition performance at the level of individual speakers. This paper addresses the limits of speaker recognition in ASV systems under voice disguise using a linear mixed effects model to analyze the impact of change in long-term statistics of selected features (formants F1-F4, the bandwidths B1-B4, F0, and speaking rate) to ASV log-likelihood ratio (LLR) score. The correlations between the proposed predictive model and the LLR scores are 0.72 for females and 0.81 for male speakers. As a whole, the difference in long-term F0 between enrollment and test utterances was found to be the individually most detrimental factor, even if the ASV system uses only spectral, rather than prosodic, features. V

show abstract

“…Style factors are shown to be present in widely used speaker representations [13] such as i-vectors [14] and xvectors [4]. ASV performance degradation due to style mismatch between the enrollment and test utterances were systematically analyzed in [15,16,17]. To alleviate the degradation due to style variabilities, some studies proposed the use of a joint factor analysis framework [11,12].…”

Section: Introductionmentioning

confidence: 99%

Variable frame rate-based data augmentation to handle speaking-style variability for automatic speaker verification

Afshan¹,

Guo²,

Park³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

The effects of speaking-style variability on automatic speaker verification were investigated using the UCLA Speaker Variability database which comprises multiple speaking styles per speaker. An x-vector/PLDA (probabilistic linear discriminant analysis) system was trained with the SRE and Switchboard databases with standard augmentation techniques and evaluated with utterances from the UCLA database. The equal error rate (EER) was low when enrollment and test utterances were of the same style (e.g., 0.98% and 0.57% for read and conversational speech, respectively), but it increased substantially when styles were mismatched between enrollment and test utterances. For instance, when enrolled with conversation utterances, the EER increased to 3.03%, 2.96% and 22.12% when tested on read, narrative, and pet-directed speech, respectively. To reduce the effect of style mismatch, we propose an entropy-based variable frame rate technique to artificially generate style-normalized representations for PLDA adaptation. The proposed system significantly improved performance. In the aforementioned conditions, the EERs improved to 2.69% (conversation -read), 2.27% (conversation -narrative), and 18.75% (pet-directedread). Overall, the proposed technique performed comparably to multi-style PLDA adaptation without the need for training data in different speaking styles per speaker.

show abstract

Towards understanding speaker discrimination abilities in humans and machines for text-independent short utterances of different speech styles

Cited by 13 publications

References 47 publications

Speaker discrimination in humans and machines: Effects of speaking style variability

Speaker discrimination in humans and machines: Effects of speaking style variability

On the limits of automatic speaker verification: Explaining degraded recognizer scores through acoustic changes resulting from voice disguise

Variable frame rate-based data augmentation to handle speaking-style variability for automatic speaker verification

Contact Info

Product

Resources

About