2019
DOI: 10.48550/arxiv.1906.04972
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Toward Interpretable Music Tagging with Self-Attention

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
24
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
9

Relationship

3
6

Authors

Journals

citations
Cited by 18 publications
(24 citation statements)
references
References 24 publications
0
24
0
Order By: Relevance
“…Some approaches (Becker et al, 2018;Won et al, 2019) have shown usability of attention/visualization techniques for interpreting audio processing networks. However, we focus here more on methods that attempt to address audio interpretability beyond image-based visualizations.…”
Section: Interpretability Methods For Audiomentioning
confidence: 99%
“…Some approaches (Becker et al, 2018;Won et al, 2019) have shown usability of attention/visualization techniques for interpreting audio processing networks. However, we focus here more on methods that attempt to address audio interpretability beyond image-based visualizations.…”
Section: Interpretability Methods For Audiomentioning
confidence: 99%
“…Baseline Models Two baselines methods are compared. The first is CNNSA [6], which employs a convolutional front-end and a transformer encoder to aggregate the temporal feature. The second baseline [13] uses 7-layer shortchunk CNN with residual connection, followed by a fullyconnect layer for final output.…”
Section: Datasetsmentioning
confidence: 99%
“…On the other hand, in [4], two-dimensional convolutional layers were used, equating the frequencyand time-axes. There are also hybrid approaches, such as convolutional recurrent neural networks (CRNN) [5] and convolutional Transformer [6], in which recurrent layers or Transformer are applied along the time axis.…”
Section: Introductionmentioning
confidence: 99%
“…This is fed into four CNN layers to extract local timbre features, which is followed by the GRU module to extract time-domain features. After CRNN layers, an attention layer is employed to strengthen important feature representations [16].…”
Section: The Architecture Of Knn-netmentioning
confidence: 99%