Ragas are characterized by their melodic motifs or catch phrases that constitute strong cues to the raga identity for both, the performer and the listener, and therefore are of great interest in music retrieval and automatic transcription. While the characteristic phrases, or pakads, appear in written notation as a sequence of notes, musicological rules for interpretation of the phrase in performance in a manner that allows considerable creative expression, while not transgressing raga grammar, are not explicitly defined. In this work, machine learning methods are used on labeled databases of Hindustani and Carnatic vocal audio concerts to obtain phrase classification on manually segmented audio. Dynamic time warping and HMM based classification are applied on time series of detected pitch values used for the melodic representation of a phrase. Retrieval experiments on ragacharacteristic phrases show promising results while providing interesting insights on the nature of variation in the surface realization of raga-characteristic motifs within and across concerts.
The tonic is a fundamental concept in Indian art music. It is the base pitch, which an artist chooses in order to construct the melodies during a rāg(a) rendition, and all accompanying instruments are tuned using the tonic pitch. Consequently, tonic identification is a fundamental task for most computational analyses of Indian art music, such as intonation analysis, melodic motif analysis and rāg recognition. In this paper we review existing approaches for tonic identification in Indian art music and evaluate them on six diverse datasets for a thorough comparison and analysis. We study the performance of each method in different contexts such as the presence/absence of additional metadata, the quality of audio data, the duration of audio data, music tradition (Hindustani/Carnatic) and the gender of the singer (male/female). We show that the approaches that combine multi-pitch analysis with machine learning provide the best performance in most cases (90% identification accuracy on average), and are robust across the aforementioned contexts compared to the approaches based on expert knowledge. In addition, we also show that the performance of the latter can be improved when additional metadata is available to further constrain the problem. Finally, we present a detailed error analysis of each method, providing further insights into the advantages and limitations of the methods.
Sound event detection (SED) takes on the task of identifying presence of specific sound events in a complex audio recording. SED has tremendous implications in video analytics, smart speaker algorithms and audio tagging. Recent advances in deep learning have afforded remarkable advances in performance of SED systems; albeit at the cost of extensive labeling efforts to train supervised methods using fully described sound class labels and timestamps. In order to address limitations in availability of training data, this work proposes a self-training technique to leverage unlabeled datasets in supervised learning using pseudo label estimation. This approach proposes a dual-term objective function: a classification loss for the original labels and expectation loss for pseudo labels. The proposed self training technique is applied to sound event detection in the context of the DCASE 2020 challenge, and reports a notable improvement over the baseline system for this task. The self-training approach is particularly effective in extending the labeled database with concurrent sound events.
Neurophysiological studies of sound encoding at the level of auditory cortex paint a picture of an intricate filterbank that encodes detailed spectral and temporal modulations in the sensory input. Furthermore, these filters exhibit adaptive qualities called neural plasticity that shape their tuning parameters in line with behavioral goals of interest. In this work, we explore qualitative principles about how this neuronal reshaping can aid in an enhanced representation of target sounds. Here, we employ a set of parameterized two-dimensional Gabor filters as basis functions that tile the space of neurophysiological spectrotemporal modulations. We examine mechanisms for judiciously retuning parameters of the Gabor filter bank in order to enhance the representation of target sounds of interest. We test the efficacy of this scheme in enhancing representation of sound tokens in adverse noisy backgrounds.
Parsing natural acoustic scenes using computational methodologies poses many challenges. Given the rich and complex nature of the acoustic environment, data mismatch between train and test conditions is a major hurdle in data-driven audio processing systems. In contrast, the brain exhibits a remarkable ability at segmenting acoustic scenes with relative ease. When tackling challenging listening conditions that are often faced in everyday life, the biological system relies on a number of principles that allow it to effortlessly parse its rich soundscape. In the current study, we leverage a key principle employed by the auditory system: its ability to adapt the neural representation of its sensory input in a high-dimensional space. We propose a framework that mimics this process in a computational model for robust speech activity detection. The system employs a 2-D Gabor filter bank whose parameters are retuned offline to improve the separability between the feature representation of speech and nonspeech sounds. This retuning process, driven by feedback from statistical models of speech and nonspeech classes, attempts to minimize the misclassification risk of mismatched data, with respect to the original statistical models. We hypothesize that this risk minimization procedure results in an emphasis of unique speech and nonspeech modulations in the high-dimensional space. We show that such an adapted system is indeed robust to other novel conditions, with a marked reduction in equal error rates for a variety of databases with additive and convolutive noise distortions. We discuss the lessons learned from biology with regard to adapting to an ever-changing acoustic environment and the impact on building truly intelligent audio processing systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.