Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies

Müller, Meinard; Arzt, Andreas; Balke, Stefan; Dorfer, Matthias; Widmer, Gerhard

doi:10.1109/msp.2018.2868887

Cited by 51 publications

(31 citation statements)

References 21 publications

(42 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Because data of different modalities can be treated as identical data in a jointembedding space and trained under a common metric, deep metric learning and joint-embedding techniques perform well together. In MIR-related tasks, deep metric learning succeeds in learning joint representations over several modalities such as a vocal and mix [23], vocal imitation and sound recording [24], [25], animal sounds [26], sheet music and audio spectrograms [27], music and image [28]- [31], and music and video [21], [22]. The target pair for the metric learning described in this paper consists of a vocal track and an accompaniment track.…”

Section: B Self-supervised and Joint-embedding Techniquesmentioning

confidence: 99%

Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques

et al. 2021

View full text Add to dashboard Cite

We propose a learning-based method of estimating the compatibility between vocal and accompaniment audio tracks, i.e., how well they go with each other when played simultaneously. This task is challenging because it is difficult to formulate hand-crafted rules or construct a large labeled dataset to perform supervised learning. Our method uses self-supervised and joint-embedding techniques for estimating vocal-accompaniment compatibility. We train vocal and accompaniment encoders to learn a jointembedding space of vocal and accompaniment tracks, where the embedded feature vectors of a compatible pair of vocal and accompaniment tracks lie close to each other and those of an incompatible pair lie far from each other. To address the lack of large labeled datasets consisting of compatible and incompatible pairs of vocal and accompaniment tracks, we propose generating such a dataset from songs using singing voice separation techniques, with which songs are separated into pairs of vocal and accompaniment tracks, and then original pairs are assumed to be compatible, and other random pairs are not. We achieved this training by constructing a large dataset containing 910,803 songs and evaluated the effectiveness of our method using ranking-based evaluation methods.INDEX TERMS Vocal-accompaniment compatibility, metric learning, music signal processing, music information retrieval.

show abstract

Section: B Self-supervised and Joint-embedding Techniquesmentioning

confidence: 99%

Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques

et al. 2021

View full text Add to dashboard Cite

show abstract

“…The Erkomaishvili dataset can be used to address a wide range of research questions including technical as well as musicological ones. For example, a cappella vocal music is a challenging scenario for various MIR tasks such as F0estimation (Salamon et al, 2014), onset detection (Böck et al, 2012), and scoretoaudio alignment (Thomas et al, 2012;Arzt, 2016;Müller et al, 2019). In particular, the not equaltempered nature of the Georgian songs and the characteristic pitch slides in traditional Georgian singing constitute challenging test scenarios for MIR algorithms.…”

Section: Applications For Mir and Musicologymentioning

confidence: 99%

Erkomaishvili Dataset: A Curated Corpus of Traditional Georgian Vocal Music for Computational Musicology

Rosenzweig¹,

Scherbaum²,

Shugliashvili³

et al. 2020

Transactions of the International Society for Music Information Retrieval

View full text Add to dashboard Cite

The analysis of recorded audio material using computational methods has received increased attention in ethnomusicological research. We present a curated dataset of traditional Georgian vocal music for computational musicology. The corpus is based on historic tape recordings of three-voice Georgian songs performed by the the former master chanter Artem Erkomaishvili. In this article, we give a detailed overview of the audio material, transcriptions, and annotations contained in the dataset. Beyond its importance for ethnomusicological research, this carefully organized and annotated corpus constitutes a challenging scenario for music information retrieval tasks such as fundamental frequency estimation, onset detection, and score-to-audio alignment. The corpus is publicly available and accessible through score-following web-players.

show abstract

“…Content-based systems can be further categorized according to the modalities involved. For an overview of multi-modal music retrieval scenarios, we refer to a survey by Müller et al [34]. In our contribution, we focus on retrieval scenarios, where both query and database documents are audio recordings.…”

Section: Related Workmentioning

confidence: 99%

Learning Low-Dimensional Embeddings of Audio Shingles for Cross-Version Retrieval of Classical Music

Zalkow

Müller

2019

Applied Sciences

Self Cite

View full text Add to dashboard Cite

Cross-version music retrieval aims at identifying all versions of a given piece of music using a short query audio fragment. One previous approach, which is particularly suited for Western classical music, is based on a nearest neighbor search using short sequences of chroma features, also referred to as audio shingles. From the viewpoint of efficiency, indexing and dimensionality reduction are important aspects. In this paper, we extend previous work by adapting two embedding techniques; one is based on classical principle component analysis, and the other is based on neural networks with triplet loss. Furthermore, we report on systematically conducted experiments with Western classical music recordings and discuss the trade-off between retrieval quality and embedding dimensionality. As one main result, we show that, using neural networks, one can reduce the audio shingles from 240 to fewer than 8 dimensions with only a moderate loss in retrieval accuracy. In addition, we present extended experiments with databases of different sizes and different query lengths to test the scalability and generalizability of the dimensionality reduction methods. We also provide a more detailed view into the retrieval problem by analyzing the distances that appear in the nearest neighbor search.

show abstract

Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies

Cited by 51 publications

References 21 publications

Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques

Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques

Erkomaishvili Dataset: A Curated Corpus of Traditional Georgian Vocal Music for Computational Musicology

Learning Low-Dimensional Embeddings of Audio Shingles for Cross-Version Retrieval of Classical Music

Contact Info

Product

Resources

About