Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications

Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model? This paper presents an extension of the Wave-U-Net [1] model which allows end-to-end monaural source separation with a non-fixed number of sources. Furthermore, we propose multiplicative conditioning with instrument labels at the bottleneck of the Wave-U-Net and show its effect on the separation results. This approach can be further extended to other types of conditioning such as audio-visual source separation and score-informed source separation.

show abstract

“…2. Results in terms of SDR, SIR, and SAR averaged and reported by the number of instruments in the testing set of URMP [17] dataset.…”

Section: Resultsmentioning

confidence: 99%

End-to-end Sound Source Separation Conditioned on Instrument Labels

Slizovskaia

Kim

Haro

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Therefore, we compose two novel datasets to train and evaluate our models, and they are a Subset of URMP (Sub-URMP) dataset and a ImageNet Image-Sound (INIS) dataset. Sub-URMP dataset is composed from the original URMP dataset [11]. It contains 13 music instrument categories.…”

Section: Datasetsmentioning

confidence: 99%

“…To explore this new problem space, we compose two datasets, e.g., Sub-URMP and INIS. The Sub-URMP dataset consists of paired images and sounds extracted from 107 single-instrument musical performance videos of 13 kinds of instruments in the University of Rochester Musical Performance (URMP) dataset [11]. In total 17,555 images are extracted and each image is paired with a halfsecond long sound clip.…”

Section: Introductionmentioning

confidence: 99%

Deep Cross-Modal Audio-Visual Generation

Chen

Srivastava

Duan

et al. 2017

Proceedings of the on Thematic Workshops of ACM Multimedia 2017

Self Cite

197

158

View full text Add to dashboard Cite

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite works in computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluations demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.

show abstract

“…There are some similar works that generate images condition on sounds, such as [19] [20]. In these works, they use different dataset called Sub-URMP [19] [21] which is composed of sounds of musical performances with monotonous background and similar composition in images. By using different training scenario, they achieve the goal of generating images which depict a single person with an instrument correspond to input sound.…”

Section: Related Workmentioning

confidence: 99%

Towards Audio to Scene Image Synthesis Using Generative Adversarial Network

Wan

Chuang

Lee

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Humans can imagine a scene from a sound. We want machines to do so by using conditional generative adversarial networks (GANs). By applying the techniques including spectral norm, projection discriminator and auxiliary classifier, compared with naive conditional GAN, the model can generate images with better quality in terms of both subjective and objective evaluations. Almost three-fourth of people agree that our model have the ability to generate images related to sounds. By inputting different volumes of the same sound, our model output different scales of changes based on the volumes, showing that our model truly knows the relationship between sounds and images to some extent.

show abstract

Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications

Cited by 124 publications

References 43 publications

End-to-end Sound Source Separation Conditioned on Instrument Labels

End-to-end Sound Source Separation Conditioned on Instrument Labels

Deep Cross-Modal Audio-Visual Generation

Towards Audio to Scene Image Synthesis Using Generative Adversarial Network

Contact Info

Product

Resources

About