Temporal Pyramid Pooling Convolutional Neural Network for Cover Song Identification

Yu, Zhesong; Xu, Xiaoshuo; Chen, Xiaoou; Yang, Dingcheng

doi:10.24963/ijcai.2019/673

Cited by 28 publications

(46 citation statements)

References 2 publications

(6 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Results on Da-TACOS 2DFTM [17] 0.275 155 SiMPle [18] 0.332 142 Dmax [14] 0.322 132 Qmax [10] 0.365 113 Qmax* [30] 0.373 104 EarlyFusion [12] 0.426 116 LateFusion [14] 0.454 177 MOVE w/ d = 4 k (ours) 0.489 43 MOVE w/ d = 16 k (ours) 0.506 42 Results on YTC SiMPle [18] 0.591 8 2DFTM sequences [29] 0.648 8 InNet [19] 0.660 6 SuCo-DTW [31] 0.800 3 CQT-TPPNet [20] 0.859 3 MOVE w/ d = 16 k (ours) 0.885 3 Table 2. Comparison of state-of-the-art VI systems (best results are highlighted in bold).…”

Section: Map Mr1mentioning

confidence: 99%

Accurate and Scalable Version Identification Using Musically-Motivated Embeddings

Yesiler

Serrà²

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The version identification (VI) task deals with the automatic detection of recordings that correspond to the same underlying musical piece. Despite many efforts, VI is still an open problem, with much room for improvement, specially with regard to combining accuracy and scalability. In this paper, we present MOVE, a musically-motivated method for accurate and scalable version identification. MOVE achieves state-of-the-art performance on two publicly-available benchmark sets by learning scalable embeddings in an Euclidean distance space, using a triplet loss and a hard triplet mining strategy. It improves over previous work by employing an alternative input representation, and introducing a novel technique for temporal content summarization, a standardized latent space, and a data augmentation strategy specifically designed for VI. In addition to the main results, we perform an ablation study to highlight the importance of our design choices, and study the relation between embedding dimensionality and model performance.Index Terms-Cover song identification, deep learning, music embedding, network encoder.

show abstract

Section: Map Mr1mentioning

confidence: 99%

Accurate and Scalable Version Identification Using Musically-Motivated Embeddings

Yesiler

Serrà²

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Each song in Youtube has 7 versions, with 2 original versions and 5 different versions and thus results in 350 recordings in total. In our experiment, we use the 100 original versions as references and the others as queries following the same as [15,9,8].…”

Section: Datasetmentioning

confidence: 99%

“…Results on Youtube DPLA [2] 0.525 0.132 9.43 2420s SiMPle [15] 0.591 0.140 7.91 18.7s Fingerprinting [16] 0.648 0.145 8.27 -SuCo-DTW [17] 0.800 0.180 3.42 4.59s Ki-CNN [8] 0.656 0.155 6.26 0.35ms TPPNet [9] 0.859 0.188 2.85 0.04ms CQT-Net 0.917 0.192 2.50 0.04ms Results on Covers80 NCP-WIDI [18] 0.645 ---CRP [3] 0.544 0.061 --Fusing [19] 0.625 0.071 --Ki-CNN [8] 0.506 0.068 16.4 0.55ms TPPNet [9] 0.744 0.086 6.88 0.06ms CQT-Net 0.840 0.091 3.85 0.06ms Results on Mazurkas DTW [15] 0.882 0.949 4.05 -NCD [20] 0.767 ---Compression [21] 0.795 ---Fingerprinting [22] 0.819 ---SiMPle [15] 0.880 0.952 2.33 -SuCo-repeat [17] 0.850 0.940 2.77 -2DFM [4] 0 Table 1. Performance on different datasets (-indicates the results are not shown in original works).…”

Section: Mr1 Timementioning

confidence: 99%

“…Moreover, the query time shown in the table does not include the time of feature extracting. Therefore, our method has the same time consumed as [9]. It extracts a fixeddimensional feature whatever the duration of input audio is.…”

Section: Comparisonmentioning

confidence: 99%

“…Moreover, deep learning approaches are introduced to cover song identification. For instance, CNNs are utilized to measure the similarity matrix [6] or learn features [7,8,9,10]. While these methods have achieved promising results, there is still room for improvement.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Learning a Representation for Cover Song Identification Using Convolutional Neural Network

Chen

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Cover song identification represents a challenging task in the field of Music Information Retrieval (MIR) due to complex musical variations between query tracks and cover versions. Previous works typically utilize hand-crafted features and alignment algorithms for the task. More recently, further breakthroughs are achieved employing neural network approaches. In this paper, we propose a novel Convolutional Neural Network (CNN) architecture based on the characteristics of the cover song task. We first train the network through classification strategies; the network is then used to extract music representation for cover song identification. A scheme is designed to train robust models against tempo changes. Experimental results show that our approach outperforms state-of-the-art methods on all public datasets, improving the performance especially on the large dataset.

show abstract

Temporal Pyramid Pooling for Decoding Motor-Imagery EEG Signals

Jeong

2021

IEEE Access

View full text Add to dashboard Cite

Detecting a user's intentions is critical in human-computer interactions. Recently, braincomputer interfaces (BCIs) have been extensively studied to facilitate more accurate detection and prediction of the user's intentions. Specifically, various deep learning approaches have been applied to the BCIs for decoding the user's intent from motor-imagery electroencephalography (EEG) signals. However, their ability to capture the important features of an EEG signal remains limited, resulting in the deterioration of performance. In this paper, we propose a multi-layer temporal pyramid pooling approach to improve the performance of motor imagery-based BCIs. The proposed scheme introduces the application of multilayer multiscale pooling and fusion methods to capture various features of an EEG signal, which can be easily integrated into modern convolutional neural networks (CNNs). The experimental results based on the BCI competition IV dataset indicate that the CNN architectures with the proposed multilayer pyramid pooling method enhance classification performance compared to the original networks.

show abstract

Temporal Pyramid Pooling Convolutional Neural Network for Cover Song Identification

Cited by 28 publications

References 2 publications

Accurate and Scalable Version Identification Using Musically-Motivated Embeddings

Accurate and Scalable Version Identification Using Musically-Motivated Embeddings

Learning a Representation for Cover Song Identification Using Convolutional Neural Network

Temporal Pyramid Pooling for Decoding Motor-Imagery EEG Signals

Contact Info

Product

Resources

About