2018 IEEE International Conference on Multimedia and Expo (ICME) 2018
DOI: 10.1109/icme.2018.8486531
|View full text |Cite
|
Sign up to set email alerts
|

Key-Invariant Convolutional Neural Network Toward Efficient Cover Song Identification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
37
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 26 publications
(37 citation statements)
references
References 6 publications
0
37
0
Order By: Relevance
“…With the current best setup, the total number of parameters is 6.3 M. We now motivate and present the key components of MOVE. Transposition-invariant architecture -Following the strategy proposed by Xu et al [19], we increase the dimension of the crema-PCP inputs X from 12×T to 23×T by concatenating two copies of X in the pitch dimension and removing the last pitch class. The first convolutional layer, with a kernel size of 12×180 traverses the input, going through all possible transpositions in the pitch dimension, and the subsequent max-pooling layer, with a kernel size of 12×1, keeps the transposition with the highest activation value (convolutions in MOVE have no padding).…”
Section: Network Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…With the current best setup, the total number of parameters is 6.3 M. We now motivate and present the key components of MOVE. Transposition-invariant architecture -Following the strategy proposed by Xu et al [19], we increase the dimension of the crema-PCP inputs X from 12×T to 23×T by concatenating two copies of X in the pitch dimension and removing the last pitch class. The first convolutional layer, with a kernel size of 12×180 traverses the input, going through all possible transpositions in the pitch dimension, and the subsequent max-pooling layer, with a kernel size of 12×1, keeps the transposition with the highest activation value (convolutions in MOVE have no padding).…”
Section: Network Architecturementioning
confidence: 99%
“…Results on Da-TACOS 2DFTM [17] 0.275 155 SiMPle [18] 0.332 142 Dmax [14] 0.322 132 Qmax [10] 0.365 113 Qmax* [30] 0.373 104 EarlyFusion [12] 0.426 116 LateFusion [14] 0.454 177 MOVE w/ d = 4 k (ours) 0.489 43 MOVE w/ d = 16 k (ours) 0.506 42 Results on YTC SiMPle [18] 0.591 8 2DFTM sequences [29] 0.648 8 InNet [19] 0.660 6 SuCo-DTW [31] 0.800 3 CQT-TPPNet [20] 0.859 3 MOVE w/ d = 16 k (ours) 0.885 3 Table 2. Comparison of state-of-the-art VI systems (best results are highlighted in bold).…”
Section: Map Mr1mentioning
confidence: 99%
“…Second Hand Songs 100K (SHS100K), which is collected from Second Hand Songs website by [8], consisting of 8858 songs with various covers and 108523 recordings. This dataset is split into three subsets -SHS100K-TRAIN, SHS100K-VAL and SHS100K-TEST with a ratio of 8 : 1 : 1.…”
Section: Datasetmentioning
confidence: 99%
“…Each song in Youtube has 7 versions, with 2 original versions and 5 different versions and thus results in 350 recordings in total. In our experiment, we use the 100 original versions as references and the others as queries following the same as [15,9,8].…”
Section: Datasetmentioning
confidence: 99%
See 1 more Smart Citation