Learning Frame Similarity using Siamese networks for Audio-to-Score Alignment

Agrawal, Ruchit; Dixon, Simon

doi:10.23919/eusipco47968.2020.9287625

Cited by 9 publications

(7 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, their method struggles with pieces containing both backward and forward jumps, which is an important challenge we tackle using our progressively dilated convolutional models. Recent work on audio-to-score alignment has demonstrated the efficacy of multimodal embeddings [15], reinforcement learning [16], [17] and learnt frame similarities [18], albeit these are not structure-aware methods. Very recently, Shan et al propose Hierarchical-DTW [19] to automatically generate piano score following videos given an audio and a raw image of sheet music.…”

Section: Related Workmentioning

confidence: 99%

Structure-Aware Audio-to-Score Alignment Using Progressively Dilated Convolutional Neural Networks

Agrawal

Wolff

Dixon

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

The identification of structural differences between a music performance and the score is a challenging yet integral step of audioto-score alignment, an important subtask of music information retrieval. We present a novel method to detect such differences between the score and performance for a given piece of music using progressively dilated convolutional neural networks. Our method incorporates varying dilation rates at different layers to capture both short-term and long-term context, and can be employed successfully in the presence of limited annotated data. We conduct experiments on audio recordings of real performances that differ structurally from the score, and our results demonstrate that our models outperform standard methods for structure-aware audio-to-score alignment.

show abstract

Section: Related Workmentioning

confidence: 99%

Structure-Aware Audio-to-Score Alignment Using Progressively Dilated Convolutional Neural Networks

Agrawal

Wolff

Dixon

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…IRs allow us to model different recording conditions in the form of microphones and room characteristics. We use more than 500 freely available IRs collected from OpenAIRLib 4 and MicIRP. 5 The IR signal is convolved with the audio on-the-fly during training, such that in each epoch the model encounters different audio scenarios, which should allow for a more robust audio encoding model.…”

Section: A Impulse Response Data Augmentationmentioning

confidence: 99%

Multi-modal Conditional Bounding Box Regression for Music Score Following

Henkel¹,

Widmer²

2021

Preprint

View full text Add to dashboard Cite

This paper addresses the problem of sheet-imagebased on-line audio-to-score alignment also known as score following. Drawing inspiration from object detection, a conditional neural network architecture is proposed that directly predicts x,y coordinates of the matching positions in a complete score sheet image at each point in time for a given musical performance.Experiments are conducted on a synthetic polyphonic piano benchmark dataset and the new method is compared to several existing approaches from the literature for sheet-image-based score following as well as an Optical Music Recognition baseline.The proposed approach achieves new state-of-the-art results and furthermore significantly improves the alignment performance on a set of real-world piano recordings by applying Impulse Responses as a data augmentation technique.

show abstract

“…This is the consequence of the ability of DNNs to learn complex nonlinear mappings through which musical objects can be expected to be better separated [11]. Hence, while DNNs generally learn "deep" features stemming from multiple examples in a training phase, and then evaluate the potential of learned features [12], our SSAE approach focuses on the different events within a single song and tries to learn nonlinear latent representations, used to infer the structure.…”

Section: Introductionmentioning

confidence: 99%

Barwise Compression Schemes for Audio-Based Music Structure Analysis

Marmoret¹,

Cohen²,

Bimbot³

2022

Preprint

View full text Add to dashboard Cite

Music Structure Analysis (MSA) consists in segmenting a music piece in several distinct sections. We approach MSA within a compression framework, under the hypothesis that the structure is more easily revealed by a simplified representation of the original content of the song.More specifically, under the hypothesis that MSA is correlated with similarities occurring at the bar scale, linear and non-linear compression schemes can be applied to barwise audio signals. Compressed representations capture the most salient components of the different bars in the song and are then used to infer the song structure using a dynamic programming algorithm.This work explores both low-rank approximation models such as Principal Component Analysis or Nonnegative Matrix Factorization and "piece-specific" Auto-Encoding Neural Networks, with the objective to learn latent representations specific to a given song. Such approaches do not rely on supervision nor annotations, which are well-known to be tedious to collect and possibly ambiguous in MSA description.In our experiments, several unsupervised compression schemes achieve a level of performance comparable to that of stateof-the-art supervised methods (for 3s tolerance) on the RWC-Pop dataset, showcasing the importance of the barwise compression processing for MSA.

show abstract

Learning Frame Similarity using Siamese networks for Audio-to-Score Alignment

Cited by 9 publications

References 17 publications

Structure-Aware Audio-to-Score Alignment Using Progressively Dilated Convolutional Neural Networks

Structure-Aware Audio-to-Score Alignment Using Progressively Dilated Convolutional Neural Networks

Multi-modal Conditional Bounding Box Regression for Music Score Following

Barwise Compression Schemes for Audio-Based Music Structure Analysis

Contact Info

Product

Resources

About