ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414128
|View full text |Cite
|
Sign up to set email alerts
|

Bytecover: Cover Song Identification Via Multi-Loss Training

Abstract: We present in this paper ByteCover, which is a new feature learning method for cover song identification (CSI). Byte-Cover is built based on the classical ResNet model, and two major improvements are designed to further enhance the capability of the model for CSI. In the first improvement, we introduce the integration of instance normalization (IN) and batch normalization (BN) to build IBN blocks, which are major components of our ResNet-IBN model. With the help of the IBN blocks, our CSI model can learn featu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 17 publications
(14 citation statements)
references
References 13 publications
0
11
0
Order By: Relevance
“…In the selfattention layer, we use a random mask to erase some value of the embedding to improve the robustness further. As a result, the introduced time-pooling module leverages feature statistics and pays more attention to the discriminative frames, which leads to clear outperformance compared to the widely-adopted generalized mean pooling (GeM) in recent approaches [6], [7], [10]. More ablation studies will be shown in section III-D.…”
Section: B Time Domain Pooling Modulementioning
confidence: 99%
See 4 more Smart Citations
“…In the selfattention layer, we use a random mask to erase some value of the embedding to improve the robustness further. As a result, the introduced time-pooling module leverages feature statistics and pays more attention to the discriminative frames, which leads to clear outperformance compared to the widely-adopted generalized mean pooling (GeM) in recent approaches [6], [7], [10]. More ablation studies will be shown in section III-D.…”
Section: B Time Domain Pooling Modulementioning
confidence: 99%
“…We use the embedding before the bottleneck layer to compute the contrast loss (i.e., triplet loss [14]) and use the embedding after the bottleneck layer to compute the focal loss [15] and center loss [16]. Note that we use the focal loss to replace the traditional cross-entropy loss as in [6], improving the performance when facing data unbalance. Center loss [16] helps the training convergence and achieves higher performance.…”
Section: Lossmentioning
confidence: 99%
See 3 more Smart Citations