Abstract:Music source separation is the task of decomposing music into its constitutive components, e.g., yielding separated stems for the vocals, bass, and drums. Such a separation has many applications ranging from rearranging/repurposing the stems (remixing, repanning, upmixing) to full extraction (karaoke, sample creation, audio restoration). Music separation has a long history of scientific activity as it is known to be a very challenging problem. In recent years, deep learning-based systems-for the first time-yie… Show more
“…The paper proposes a new SID model extending from CRNN and involving the use of melody information by leveraging CREPE [6]. Also, a data augmentation method called shuffleand-remix is adopted to avoid the confounds from the accompaniments by using source separation [12]. Our evaluation shows that both melody information and data augmentation improve the result, especially the latter.…”
Section: Discussionmentioning
confidence: 96%
“…In contrast, in our work both the SS model and the SID model employ deep learning. Specifically, we use open-unmix [12], an open-source three-layer bidirectional deep recurrent neural network for SS. Moreover, we build upon our SID model based on the implementation of a convolutional recurrent neural network made available by Nasrullah and Zhao [17], which attains the highest song-level F1-score of 0.67 on the per-album split of the artist20 dataset [18], a standard dataset for SID.…”
Section: Conv Blockmentioning
confidence: 99%
“…For simplicity, we use the same design for the mel-spectrogram branch and the melody contour branch. Second, instead of using the mel-spectrogram of the mixture audio recordings, we employ open-unmix [12] to remove the instrumental part of the music, and use the proposed data augmentation technique to increase the size of the training data, as described below.…”
Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer only sings in certain musical contexts (e.g., genres). The model cannot therefore generalize well when the singer sings in unseen contexts. In this paper, we attempt to address this issue. Specifically, we employ open-unmix, an open source tool with state-of-the-art performance in source separation, to separate the vocal and instrumental tracks of music. We then investigate two means to train a singer identification model: by learning from the separated vocal only, or from an augmented set of data where we "shuffle-and-remix" the separated vocal tracks and instrumental tracks of different songs to artificially make the singers sing in different contexts. We also incorporate melodic features learned from the vocal melody contour for better performance. Evaluation results on a benchmark dataset called the artist20 shows that this data augmentation method greatly improves the accuracy of singer identification.
“…The paper proposes a new SID model extending from CRNN and involving the use of melody information by leveraging CREPE [6]. Also, a data augmentation method called shuffleand-remix is adopted to avoid the confounds from the accompaniments by using source separation [12]. Our evaluation shows that both melody information and data augmentation improve the result, especially the latter.…”
Section: Discussionmentioning
confidence: 96%
“…In contrast, in our work both the SS model and the SID model employ deep learning. Specifically, we use open-unmix [12], an open-source three-layer bidirectional deep recurrent neural network for SS. Moreover, we build upon our SID model based on the implementation of a convolutional recurrent neural network made available by Nasrullah and Zhao [17], which attains the highest song-level F1-score of 0.67 on the per-album split of the artist20 dataset [18], a standard dataset for SID.…”
Section: Conv Blockmentioning
confidence: 99%
“…For simplicity, we use the same design for the mel-spectrogram branch and the melody contour branch. Second, instead of using the mel-spectrogram of the mixture audio recordings, we employ open-unmix [12] to remove the instrumental part of the music, and use the proposed data augmentation technique to increase the size of the training data, as described below.…”
Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer only sings in certain musical contexts (e.g., genres). The model cannot therefore generalize well when the singer sings in unseen contexts. In this paper, we attempt to address this issue. Specifically, we employ open-unmix, an open source tool with state-of-the-art performance in source separation, to separate the vocal and instrumental tracks of music. We then investigate two means to train a singer identification model: by learning from the separated vocal only, or from an augmented set of data where we "shuffle-and-remix" the separated vocal tracks and instrumental tracks of different songs to artificially make the singers sing in different contexts. We also incorporate melodic features learned from the vocal melody contour for better performance. Evaluation results on a benchmark dataset called the artist20 shows that this data augmentation method greatly improves the accuracy of singer identification.
“…A major design choice in music source separation models is whether to (1) train a separate model for each instrument [12], (2) to use a single class-conditional model, or (3) to use an instrument agnostic approach [16]. Our approach aims to combine the advantages of the first two; the high-precision of independent models, with improved optimization via parameter sharing in single models.…”
We propose a hierarchical meta-learning-inspired model for music source separation (Meta-TasNet) in which a generator model is used to predict the weights of individual extractor models. This enables efficient parameter-sharing, while still allowing for instrument-specific parameterization. Meta-TasNet is shown to be more effective than the models trained independently or in a multi-task setting, and achieve performance comparable with state-of-the-art methods. In comparison to the latter, our extractors contain fewer parameters and have faster run-time performance. We discuss important architectural considerations, and explore the costs and benefits of this approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.