The INTERSPEECH 2016 Computational Paralinguistics Challenge addresses three different problems for the first time in research competition under well-defined conditions: classification of deceptive vs. non-deceptive speech, the estimation of the degree of sincerity, and the identification of the native language out of eleven L1 classes of English L2 speakers. In this paper, we describe these sub-challenges, their conditions, the baseline feature extraction and classifiers, and the resulting baselines, as provided to the participants.
We sustain that the structure of affect elicited by music is largely dependent on dynamic temporal patterns in low-level music structural parameters. In support of this claim, we have previously provided evidence that spatiotemporal dynamics in psychoacoustic features resonate with two psychological dimensions of affect underlying judgments of subjective feelings: arousal and valence. In this article we extend our previous investigations in two aspects. First, we focus on the emotions experienced rather than perceived while listening to music. Second, we evaluate the extent to which peripheral feedback in music can account for the predicted emotional responses, that is, the role of physiological arousal in determining the intensity and valence of musical emotions. Akin to our previous findings, we will show that a significant part of the listeners' reported emotions can be predicted from a set of six psychoacoustic features--loudness, pitch level, pitch contour, tempo, texture, and sharpness. Furthermore, the accuracy of those predictions is improved with the inclusion of physiological cues--skin conductance and heart rate. The interdisciplinary work presented here provides a new methodology to the field of music and emotion research based on the combination of computational and experimental work, which aid the analysis of the emotional responses to music, while offering a platform for the abstract representation of those complex relationships. Future developments may aid specific areas, such as, psychology and music therapy, by providing coherent descriptions of the emotional effects of specific music stimuli.
There is strong evidence of shared acoustic profiles common to the expression of emotions in music and speech, yet relatively limited understanding of the specific psychoacoustic features involved. This study combined a controlled experiment and computational modelling to investigate the perceptual codes associated with the expression of emotion in the acoustic domain. The empirical stage of the study provided continuous human ratings of emotions perceived in excerpts of film music and natural speech samples. The computational stage created a computer model that retrieves the relevant information from the acoustic stimuli and makes predictions about the emotional expressiveness of speech and music close to the responses of human subjects. We show that a significant part of the listeners' second-by-second reported emotions to music and speech prosody can be predicted from a set of seven psychoacoustic features: loudness, tempo/speech rate, melody/prosody contour, spectral centroid, spectral flux, sharpness, and roughness. The implications of these results are discussed in the context of cross-modal similarities in the communication of emotion in the acoustic domain.
THIS ARTICLE PRESENTS A NOVEL METHODOLOGY TOanalyze the dynamics of emotional responses to music. It consists of a computational investigation based on spatiotemporal neural networks, which "mimic" human affective responses to music and predict the responses to novel music sequences. The results provide evidence suggesting that spatiotemporal patterns of sound resonate with affective features underlying judgments of subjective feelings (arousal and valence). A significant part of the listener's affective response is predicted from a set of six psychoacoustic features of sound-loudness, tempo, texture, mean pitch, pitch variation, and sharpness. A detailed analysis of the network parameters and dynamics also allows us to identify the role of specific psychoacoustic variables (e.g., tempo and loudness) in music emotional appraisal. This work contributes new evidence and insights to the study of musical emotions, with particular relevance to the music perception and cognition research community.
Coping with scarcity of labeled data is a common problem in sound classification tasks. Approaches for classifying sounds are commonly based on supervised learning algorithms, which require labeled data which is often scarce and leads to models that do not generalize well. In this paper, we make an efficient combination of confidence-based Active Learning and Self-Training with the aim of minimizing the need for human annotation for sound classification model training. The proposed method pre-processes the instances that are ready for labeling by calculating their classifier confidence scores, and then delivers the candidates with lower scores to human annotators, and those with high scores are automatically labeled by the machine. We demonstrate the feasibility and efficacy of this method in two practical scenarios: pool-based and stream-based processing. Extensive experimental results indicate that our approach requires significantly less labeled instances to reach the same performance in both scenarios compared to Passive Learning, Active Learning and Self-Training. A reduction of 52.2% in human labeled instances is achieved in both of the pool-based and stream-based scenarios on a sound classification task considering 16,930 sound instances.
In this work, we compared emotions induced by the same performance of Schubert Lieder during a live concert and in a laboratory viewing/listening setting to determine the extent to which laboratory research on affective reactions to music approximates real listening conditions in dedicated performances. We measured emotions experienced by volunteer members of an audience that attended a Lieder recital in a church (Context 1) and emotional reactions to an audio-video-recording of the same performance in a university lecture hall (Context 2). Three groups of participants were exposed to three presentation versions in Context 2: (1) an audio-visual recording, (2) an audio-only recording, and (3) a video-only recording. Participants achieved statistically higher levels of emotional convergence in the live performance than in the laboratory context, and the experience of particular emotions was determined by complex interactions between auditory and visual cues in the performance. This study demonstrates the contribution of the performance setting and the performers’ appearance and nonverbal expression to emotion induction by music, encouraging further systematic research into the factors involved.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.