Toward Interpretable Music Tagging with Self-Attention

Won, Minz; Chun, Sanghyuk; Serra, Xavier

doi:10.48550/arxiv.1906.04972

Cited by 18 publications

(24 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some approaches (Becker et al, 2018;Won et al, 2019) have shown usability of attention/visualization techniques for interpreting audio processing networks. However, we focus here more on methods that attempt to address audio interpretability beyond image-based visualizations.…”

Section: Interpretability Methods For Audiomentioning

confidence: 99%

Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Parekh¹,

Parekh²,

Mozharovskyi³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a carefully regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs.Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision. We demonstrate our method's applicability on popular benchmarks, including a real-world multi-label classification task.

show abstract

Section: Interpretability Methods For Audiomentioning

confidence: 99%

Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Parekh¹,

Parekh²,

Mozharovskyi³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Baseline Models Two baselines methods are compared. The first is CNNSA [6], which employs a convolutional front-end and a transformer encoder to aggregate the temporal feature. The second baseline [13] uses 7-layer shortchunk CNN with residual connection, followed by a fullyconnect layer for final output.…”

Section: Datasetsmentioning

confidence: 99%

“…On the other hand, in [4], two-dimensional convolutional layers were used, equating the frequencyand time-axes. There are also hybrid approaches, such as convolutional recurrent neural networks (CRNN) [5] and convolutional Transformer [6], in which recurrent layers or Transformer are applied along the time axis.…”

Section: Introductionmentioning

confidence: 99%

SpecTNT: a Time-Frequency Transformer for Music Audio

Lu¹,

Wang²,

Won³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Transformers have drawn attention in the MIR field for their remarkable performance shown in natural language processing and computer vision. However, prior works in the audio processing domain mostly use Transformer as a temporal feature aggregator that acts similar to RNNs. In this paper, we propose SpecTNT, a Transformerbased architecture to model both spectral and temporal sequences of an input time-frequency representation. Specifically, we introduce a novel variant of the Transformer-in-Transformer (TNT) architecture. In each SpecTNT block, a spectral Transformer extracts frequency-related features into the frequency class token (FCT) for each frame. Later, the FCTs are linearly projected and added to the temporal embeddings (TEs), which aggregate useful information from the FCTs. Then, a temporal Transformer processes the TEs to exchange information across the time axis. By stacking the SpecTNT blocks, we build the SpecTNT model to learn the representation for music signals. In experiments, SpecTNT demonstrates state-of-the-art performance in music tagging and vocal melody extraction, and shows competitive performance for chord recognition. The effectiveness of SpecTNT and other design choices are further examined through ablation studies.

show abstract

“…This is fed into four CNN layers to extract local timbre features, which is followed by the GRU module to extract time-domain features. After CRNN layers, an attention layer is employed to strengthen important feature representations [16].…”

Section: The Architecture Of Knn-netmentioning

confidence: 99%

Singer Identification Using Deep Timbre Feature Learning with KNN-Net

Zhang

Qian

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we study the issue of automatic singer identification (SID) in popular music recordings, which aims to recognize who sang a given piece of song. The main challenge for this investigation lies in the fact that a singer's singing voice changes and intertwines with the signal of background accompaniment in time domain. To handle this challenge, we propose the KNN-Net for SID, which is a deep neural network model with the goal of learning local timbre feature representation from the mixture of singer voice and background music. Unlike other deep neural networks using the softmax layer as the output layer, we instead utilize the KNN as a more interpretable layer to output target singer labels. Moreover, attention mechanism is first introduced to highlight crucial timbre features for SID. Experiments on the existing artist20 dataset show that the proposed approach outperforms the state-of-the-art method by 4%. We also create singer32 and singer60 datasets consisting of Chinese pop music to evaluate the reliability of the proposed method. The more extensive experiments additionally indicate that our proposed model achieves a significant performance improvement compared to the state-of-the-art methods.

show abstract

Toward Interpretable Music Tagging with Self-Attention

Cited by 18 publications

References 24 publications

Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

SpecTNT: a Time-Frequency Transformer for Music Audio

Singer Identification Using Deep Timbre Feature Learning with KNN-Net

Contact Info

Product

Resources

About