Beyond speech: Exploring diversity in the human voice

Anikin, Andrey; Canessa-Pollard, Valentina; Pisanski, Katarzyna; Massenet, Mathilde; Reby, David

doi:10.1016/j.isci.2023.108204

Cited by 8 publications

(9 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Speech and song are produced by the same vocal tract, yet each makes distinct demands on musculature, breathing, and motor control mechanisms 13,14 , raising the possibility that certain acoustical cues could serve as markers of each category 15 . However, even though people readily distinguish speech and song, the cues underlying the categories, even within cultures, are far from clear 11,16–19 , so that such a claim is difficult to address. Indeed, even if speech and singing reliably exist as separate, recognizable entities, their cognitive representation could depend mostly on learned regularities that are particular to each cultural group.…”

Section: Introductionmentioning

confidence: 99%

Spectro-temporal acoustical markers differentiate speech from song across cultures

Albouy

Mehr

Hoyer

et al. 2023

Preprint

View full text Add to dashboard Cite

Humans produce two primary forms of vocal communication: speaking and singing. What is the basis for these two categories? Is the distinction between them based primarily on culturally specific, learned features, or do consistent acoustical cues exist that reliably distinguish speech and song worldwide? Some studies have suggested that important aspects of music can be distinguished from speech based on spectro-temporal modulation patterns, but this conclusion is based on Western music, leaving open the question of whether such a principle may apply more globally. Here, we studied the spectro-temporal modulation patterns of vocalizations produced by 369 people living in 21 urban, rural, and small-scale societies distributed across six continents. We show that specific ranges of spectral and temporal modulations differentiate speech from song in a consistent fashion, and that those ranges overlap within categories and across societies. Machine-learning analyses confirmed that this effect was cross-culturally robust, with vocalizations reliably classified solely from their spectro-temporal modulation patterns across all 21 societies. Listeners unfamiliar with most of the cultures could also classify the vocalizations, with similar accuracy patterns as the machine learning algorithm, indicating that the spectro-temporal cues used by the classifier are similar to those used by human listeners. Thus, the two most basic forms of human vocalization appear to exploit opposite extremes of the spectro-temporal continuum in a consistent fashion across societies. The findings support the idea that the human nervous system is specialized to produce and perceive two distinct ranges of spectro-temporal modulation in the service of the two distinct modes of human vocal communication.

show abstract

Section: Introductionmentioning

confidence: 99%

Spectro-temporal acoustical markers differentiate speech from song across cultures

Albouy

Mehr

Hoyer

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Next, Anikin et al ( 82 ) curated a different global recording dataset, including not only song and speech but also various nonverbal vocalizations (e.g., laughs, cries, and screams). Their analyses using spectrotemporal modulations also confirmed lower pitch in speech and steadier notes in singing.…”

Section: Discussionmentioning

confidence: 99%

Globally, songs and instrumental melodies are slower and higher and use more stable pitches than speech: A Registered Report

Ozaki,

Tierney,

Pfordresher

et al. 2024

Sci. Adv.

View full text Add to dashboard Cite

Both music and language are found in all known human societies, yet no studies have compared similarities and differences between song, speech, and instrumental music on a global scale. In this Registered Report, we analyzed two global datasets: (i) 300 annotated audio recordings representing matched sets of traditional songs, recited lyrics, conversational speech, and instrumental melodies from our 75 coauthors speaking 55 languages; and (ii) 418 previously published adult-directed song and speech recordings from 209 individuals speaking 16 languages. Of our six preregistered predictions, five were strongly supported: Relative to speech, songs use (i) higher pitch, (ii) slower temporal rate, and (iii) more stable pitches, while both songs and speech used similar (iv) pitch interval size and (v) timbral brightness. Exploratory analyses suggest that features vary along a “musi-linguistic” continuum when including instrumental melodies and recited lyrics. Our study provides strong empirical evidence of cross-cultural regularities in music and speech.

show abstract

“…In the absence of suitable datasets for doing so, we tested at least the inter-rater reliability with which several trained raters performed manual annotation of NLP episodes. Specifically, we asked the attendants of the NLP workshop in St. Etienne in June 2023 to note all NLP episodes in a randomly selected subset of 23 vocalizations from a published corpus, all of which were reported as containing some NLP in the original publication [21]. The recordings included 10 human nonverbal vocalizations (5F + 5M), 10 speech samples (5F + 5M), and three samples of a cappella singing (2F + 1M); the duration varied from 2 to 10 s. Ten raters independently annotated four NLP types (frequency jumps, sidebands, subharmonics, and chaos).…”

Section: Nlp Annotation and Quantificationmentioning

confidence: 99%

“…As an exemplary check of NLP specificity, we calculated a variety of acoustic features (generic, NLP-specific, and derived from nonlinear time series analysis), frame by frame, in 5000 fully synthetic vocalizations (with ground truth of NLP presence and type known a priori), as well as in 1518 audio recordings of human nonverbal vocalizations, singing, and speech from [21] with a total duration of two hours and nearly 300,000 overlapping STFT frames 50 ms each (with NLP annotated manually). We then compared the values of each acoustic feature in STFT frames depending on the presence and type of NLP (see vignette analysis_any-NLP).…”

Section: Nlp Annotation and Quantificationmentioning

confidence: 99%

“…Sudden changes of f o , known as frequency jumps or pitch jumps, have primarily been researched in the context of human singing [29][30][31], but they are also found in a variety of animal calls [13,32,33] and in human nonverbal vocalizations such as screams [21,34] and baby cries [35]. Their possible causes include both conditions intrinsic to the vocal folds [36,37] and source-filter interaction with the resonances of either the supralaryngeal vocal tract or the tracheal vocal tract [38][39][40][41] ; see Herbst & Elemans in this issue for more details).…”

Section: Frequency Jumpsmentioning

confidence: 99%

See 1 more Smart Citation

How to analyze and manipulate nonlinear phenomena in voice recordings

Anikin,

Herbst

2024

Preprint

View full text Add to dashboard Cite

We address two research applications in this methodological review: starting from an audio recording, the goal may be to characterize nonlinear phenomena (NLP) at the level of voice production or to test their perceptual effects on listeners. A crucial prerequisite for this work is the ability to detect NLP in acoustic signals, which can then be correlated with biologically relevant information about the caller and with listeners’ reaction. NLP are often annotated manually, but this is labor-intensive and not very reliable, although we describe potentially helpful advanced visualization aids such as reassigned spectrograms and phasegrams. Objective acoustic features can also be useful, including general descriptives (harmonics-to-noise ratio, cepstral peak prominence, vocal roughness), statistics derived from nonlinear dynamics (correlation dimension), and NLP-specific measures (depth of modulation and subharmonics). On the perception side, playback studies can greatly benefit from tools for directly manipulating NLP in recordings. Adding frequency jumps, amplitude modulation, and subharmonics is relatively straightforward. Creating biphonation, imitating chaos, or removing NLP from a recording is more challenging, but feasible with parametric voice synthesis. We describe the most promising algorithms for analyzing and manipulating NLP and provide detailed examples with audio files and R code in supplementary materials (https://osf.io/gs8u3/).

show abstract

Beyond speech: Exploring diversity in the human voice

Cited by 8 publications

References 50 publications

Spectro-temporal acoustical markers differentiate speech from song across cultures

Spectro-temporal acoustical markers differentiate speech from song across cultures

Globally, songs and instrumental melodies are slower and higher and use more stable pitches than speech: A Registered Report

How to analyze and manipulate nonlinear phenomena in voice recordings

Contact Info

Product

Resources

About