Using functional magnetic resonance imaging (fMRI), we found an area in the fusiform gyrus in 12 of the 15 subjects tested that was significantly more active when the subjects viewed faces than when they viewed assorted common objects. This face activation was used to define a specific region of interest individually for each subject, within which several new tests of face specificity were run. In each of five subjects tested, the predefined candidate "face area" also responded significantly more strongly to passive viewing of (1) intact than scrambled two-tone faces, (2) full front-view face photos than front-view photos of houses, and (in a different set of five subjects) (3) three-quarter-view face photos (with hair concealed) than photos of human hands; it also responded more strongly during (4) a consecutive matching task performed on three-quarter-view faces versus hands. Our technique of running multiple tests applied to the same region defined functionally within individual subjects provides a solution to two common problems in functional imaging: (1) the requirement to correct for multiple statistical comparisons and (2) the inevitable ambiguity in the interpretation of any study in which only two or three conditions are compared. Our data allow us to reject alternative accounts of the function of the fusiform face area (area "FF") that appeal to visual attention, subordinate-level classification, or general processing of any animate or human forms, demonstrating that this region is selectively involved in the perception of faces.
A core goal of auditory neuroscience is to build quantitative models that predict cortical responses to natural sounds. Reasoning that a complete model of auditory cortex must solve ecologically relevant tasks, we optimized hierarchical neural networks for speech and music recognition. The best-performing network contained separate music and speech pathways following early shared processing, potentially replicating human cortical organization. The network performed both tasks as well as humans and exhibited human-like errors despite not being optimized to do so, suggesting common constraints on network and human performance. The network predicted fMRI voxel responses substantially better than traditional spectrotemporal filter models throughout auditory cortex. It also provided a quantitative signature of cortical representational hierarchy-primary and non-primary responses were best predicted by intermediate and late network layers, respectively. The results suggest that task optimization provides a powerful set of tools for modeling sensory systems.
We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision. Experimental results on a newly collected MUSIC dataset show that our proposed Mix-and-Separate framework outperforms several baselines on source separation. Qualitative results suggest our model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources.
Rainstorms, insect swarms, and galloping horses produce "sound textures"--the collective result of many similar acoustic events. Sound textures are distinguished by temporal homogeneity, suggesting they could be recognized with time-averaged statistics. To test this hypothesis, we processed real-world textures with an auditory model containing filters tuned for sound frequencies and their modulations, and measured statistics of the resulting decomposition. We then assessed the realism and recognizability of novel sounds synthesized to have matching statistics. Statistics of individual frequency channels, capturing spectral power and sparsity, generally failed to produce compelling synthetic textures; however, combining them with correlations between channels produced identifiable and natural-sounding textures. Synthesis quality declined if statistics were computed from biologically implausible auditory models. The results suggest that sound texture perception is mediated by relatively simple statistics of early auditory representations, presumably computed by downstream neural populations. The synthesis methodology offers a powerful tool for their further investigation.
by biology remains debated. One widely discussed phenomenon is that some combinations of notes are perceived by Westerners as pleasant, or consonant, whereas others are perceived as unpleasant,or dissonant. The contrast between consonance and dissonance is central to Western music and its origins have fascinated scholars since the ancient Greeks. Aesthetic responses to consonance are commonly assumed by scientists to have biological roots, and thus to be universally present in humans. Ethnomusicologists and composers, in contrast, have argued that consonance is a creation of Western musical culture. The issue has remained unresolved, partly because little is known about the extent of cross-cultural variation in consonance preferences. Here we report experiments with the Tsimane'--a native Amazonian society with minimal exposure to Western culture--and comparison populations in Bolivia and the United States that varied in exposure to Western music. Participants rated the pleasantness of sounds. Despite exhibiting Western-like discrimination abilities and Western-like aesthetic responses to familiar sounds and acoustic roughness, the Tsimane' rated consonant and dissonant chords and vocal harmonies as equally pleasant. By contrast, Bolivian city- and town-dwellers exhibited significant preferences for consonance,albeit to a lesser degree than US residents. The results indicate that consonance preferences can be absent in cultures sufficiently isolated from Western music, and are thus unlikely to reflect innate biases or exposure to harmonic natural sounds. The observed variation in preferences is presumably determined by exposure to musical harmony, suggesting that culture has a dominant role in shaping aesthetic responses to music.
Summary Some combinations of musical notes are consonant (pleasant), while others are dissonant (unpleasant), a distinction central to music. Explanations of consonance in terms of acoustics, auditory neuroscience, and enculturation have been debated for centuries [1-12]. We utilized individual differences to distinguish the candidate theories. We measured preferences for musical chords as well as nonmusical sounds that isolated particular acoustic factors – specifically, the beating and the harmonic relationships between frequency components, two factors that have long been thought to potentially underlie consonance [2, 3, 10, 13-20]. Listeners preferred stimuli without beats and with harmonic spectra, but across over 250 subjects, only the preference for harmonic spectra was consistently correlated with preferences for consonant over dissonant chords. Harmonicity preferences were also correlated with the number of years subjects had spent playing a musical instrument, suggesting that exposure to music amplifies preferences for harmonic frequencies because of their musical importance. Harmonic spectra are prominent features of natural sounds, and our results indicate they also underlie the perception of consonance.
Sensory signals are transduced at high resolution, but their structure must be stored in a more compact format. Here we provide evidence that the auditory system summarizes the temporal details of sounds using time-averaged statistics. We measured discrimination of 'sound textures' that were characterized by particular statistical properties, as normally result from the superposition of many acoustic features in auditory scenes. When listeners discriminated examples of different textures, performance improved with excerpt duration. In contrast, when listeners discriminated different examples of the same texture, performance declined with duration, a paradoxical result given that the information available for discrimination grows with duration. These results indicate that once these sounds are of moderate length, the brain's representation is limited to time-averaged statistics, which, for different examples of the same texture, converge to the same values with increasing duration. Such statistical representations produce good categorical discrimination, but limit the ability to discern temporal detail.
Probability distributions over external states (priors) are essential to the interpretation of sensory signals. Priors for cultural artifacts such as music and language remain largely uncharacterized, but likely constrain cultural transmission, because only those signals with high probability under the prior can be reliably reproduced and communicated. We developed a method to estimate priors for simple rhythms via iterated reproduction of random temporal sequences. Listeners were asked to reproduce random "seed" rhythms; their reproductions were fed back as the stimulus and over time became dominated by internal biases, such that the prior could be estimated by applying the procedure multiple times. We validated that the measured prior was consistent across the modality of reproduction and that it correctly predicted perceptual discrimination. We then measured listeners' priors over the entire space of two- and three-interval rhythms. Priors in US participants showed peaks at rhythms with simple integer ratios and were similar for musicians and non-musicians. An analogous procedure produced qualitatively different results for spoken phrases, indicating some specificity to music. Priors measured in members of a native Amazonian society were distinct from those in US participants but also featured integer ratio peaks. The results do not preclude biological constraints favoring integer ratios, but they suggest that priors on musical rhythm are substantially modulated by experience and may simply reflect the empirical distribution of rhythm that listeners encounter. The proposed method can efficiently map out a high-resolution view of biases that shape transmission and stability of simple reproducible patterns within a culture.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.