Estimation of dominant melody in polyphonic music remains a difficult task, even though promising breakthroughs have been done recently with the introduction of the Harmonic CQT and the use of fully convolutional networks. In this paper, we build upon this idea and describe how U-Net -a neural network originally designed for medical image segmentationcan be used to estimate the dominant melody in polyphonic audio. We propose in particular the use of an original layerby-layer sequential training method, and show that this method used along with careful training data conditioning improve the results compared to plain convolutional networks.
Automatic cover detection -the task of finding in a audio dataset all covers of a query track -has long been a challenging theoretical problem in MIR community. It also became a practical need for music composers societies requiring to detect automatically if an audio excerpt embeds musical content belonging to their catalog.In a recent work, we addressed this problem with a convolutional neural network mapping each track's dominant melody to an embedding vector, and trained to minimize cover pairs distance in the embeddings space, while maximizing it for non-covers. We showed in particular that training this model with enough works having five or more covers yields state-of-the-art results.This however does not reflect the realistic use case, where music catalogs typically contain works with zero or at most one or two covers. We thus introduce here a new test set incorporating these constraints, and propose two contributions to improve our model's accuracy under these stricter conditions: we replace dominant melody with multi-pitch representation as input data, and describe a novel prototypical triplet loss designed to improve covers clustering. We show that these changes improve results significantly for two concrete use cases, large dataset lookup and live songs identification.
Estimation of dominant melody in polyphonic music remains a difficult task, even though promising breakthroughs have been done recently with the introduction of the Harmonic CQT and the use of fully convolutional networks. In this paper, we build upon this idea and describe how U-Net-a neural network originally designed for medical image segmentationcan be used to estimate the dominant melody in polyphonic audio. We propose in particular the use of an original layerby-layer sequential training method, and show that this method used along with careful training data conditioning improve the results compared to plain convolutional networks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.