We tested the influence of fundamental oscillation (fo) on human and machine speaker recognition performance in vocalic test utterances. In experiment I, we trained a Gaussian-Mixture model on 15 speakers (80 multi-word utterances each) and tested it with sustained vowel utterances (/a:/, /i:/ and /u:/) under six fo conditions, three changing (fall, rise, fall-rise) and three steady-state (high, mid, low). Results revealed better performance for the steady-state compared to the changing conditions and within the steady-state condition, performance was poorest for high fo. In experiment II, we tested 9 human listeners on a subset of 4 speakers from experiment I. They went through two training tasks (training 1: multi-word utterances; training 2: words). In the test, they recognized speakers based on the same vocalic utterances as in experiment I (for these 4 speakers). Results showed that performance was about equally high for the changing and steady-state vowels, however, in the steady-state condition performance was best for high fo vowels. The experiments suggest that (a) fo has an influence on the strength of speaker specific characteristics in vowels and (b) humans-compared to machines-pay attention to different acoustic information in vocalic utterances for speaker recognition.
Human voices are individual and humans have elaborate skills in recognizing speakers by their voice, phenomena that are deeply rooted in the evolution of human behavior. To date, the mechanisms of speaker recognition are not well understood because of the high variability of the acoustic cues to a speaker's identity. We wondered what role the speaker plays in making his/her voice more or less well recognizable. While it is evident from the literature that humans can control vocal properties to enhance their intelligibility, it is unclear whether speakers can and/or do control vocal characteristics to be better recognizable and whether such control mechanisms play a role in the communication process. In this paper, we reviewed results from the literature supporting the view that speaker idiosyncratic information is dynamic and that humans have the ability to control how well they can be recognized. We suggest possible experimental setups by which the control over identity in voice can be tested and present pilot acoustic characteristics of speech that was produced to be either targeted at being (a) intelligible (clear speech) and (b) suitable for person recognition (identity marked speech). Results revealed that there is reason to believe that speakers apply different mechanisms when making their individuality identifiable as opposed to making their speech better understood. We discuss predictions that a control of recognizability and intelligibility has within major theories of speech perception.
Adult speakers commonly alter their voices when talking to infants, giving rise to an infant-directed speech (IDS) style. Here we tested the effects of infant-directed speech on the recognizability of a speaker’s voice. 10 Swiss-German mothers were recorded talking to their infants IDS and talking to an adult experimenter (in adult-directed speech, ADS). We studied the indexical properties using Mel-frequency cepstral coefficients (MFCCs). By using an unsupervised K-means clustering algorithm, the segmental 13-dimensional MFCCs were clustered and reduced to two dimensions using Principal Component Analysis. Results showed that the relative area of IDS occupied in the 13-dimensional space was significantly larger compared to the area occupied by ADS. A supervised language-independent Gaussian Mixture Model revealed that this expansion benefitted the recognizability of mothers’ voices. This means that the higher indexical variability in IDS fosters recognition of individual mothers. Results are consistent with the view that IDS may have evolved in part as a strategy to promote indexical signaling by a mother to her offspring, thereby promoting mother-infant attachment and fostering offspring survival.
Existing databases of isolated vowel sounds or vowel sounds embedded in consonantal context generally document only limited variation of basic production parameters. Thus, concerning the possible variation range of vowel and voice quality-related sound characteristics, there is a lack of broad phenomenological and descriptive references that allow for a comprehensive understanding of vowel acoustics and for an evaluation of the extent to which corresponding existing approaches and models can be generalised. In order to contribute to the building up of such references, a novel database of vowel sounds that exceeds any existing collection by size and diversity of vocalic characteristicsis presented here, comprised of c. 34600 utterances of 70 speakers (46 nonprofessional speakers, children, women and men, and 24 professional actors/actresses and singers of straight theatre, contemporary singing, and European classical singing). The database focuses on sounds of the long Standard German vowels /i-y-e-ø-a-o-u/ produced with varying basic production parameters such as phonation type, vocal effort, fundamental frequency, vowel context and speaking or singing style. In addition, a read text and, for professionals, songs are also included. The database is accessible for scientific use, and further extensions are in progress.
An unsupervised automatic clustering algorithm (k-means) classified 1282 Mel frequency cepstral coefficient (MFCC) representations of isolated steady-state vowel utterances from eight standard German vowel categories with fo between 196 and 698 Hz. Experiment I obtained the number of MFCCs (1-20) in connection with the spectral bandwidth (2-20 kHz) at which performance peaked (five MFCCs at 4 kHz). In experiment II, classification performance with different ranges of fo revealed that ranges with fo > 500 Hz reduced classification performance but it remained well above chance. This shows that isolated steady state vowels with strongly undersampled spectra contain sufficient acoustic information to be classified automatically.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.