PopMAG

Ren, Yi; He, Jinzheng; Tan, Xu; Qin, Tao; Zhao, Zhou; Liu, Tie-Yan

doi:10.1145/3394171.3413721

Cited by 52 publications

(5 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An effective musical representation is essential for learning different music-related tasks, such as music classification [7,27,36,48,55,56], cover song identification [56,58,62,63], music generation [17,18,28,40]. Most of them rely on large amounts of labeled datasets to learn music representations.…”

Section: Music Representation Learningmentioning

confidence: 99%

DisCover: Disentangled Music Representation Learning for Cover Song Identification

Xun

Zhang

Yang

et al. 2023

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

In the field of music information retrieval (MIR), cover song identification (CSI) is a challenging task that aims to identify cover versions of a query song from a massive collection. Existing works still suffer from high intra-song variances and inter-song correlations, due to the entangled nature of version-specific and version-invariant factors in their modeling. In this work, we set the goal of disentangling version-specific and version-invariant factors, which could make it easier for the model to learn invariant music representations for unseen query songs. We analyze the CSI task in a disentanglement view with the causal graph technique, and identify the intra-version and inter-version effects biasing the invariant learning. To block these effects, we propose the disentangled music representation learning framework (DisCover) for CSI. DisCover consists of two critical components: (1) Knowledge-guided Disentanglement Module (KDM) and (2) Gradient-based Adversarial Disentanglement Module (GADM), which block intra-version and inter-version biased effects, respectively. KDM minimizes the mutual information between the learned representations and version-variant factors that are identified with prior domain knowledge. GADM identifies version-variant factors by simulating the representation transitions between intra-song versions, and exploits adversarial distillation

show abstract

Section: Music Representation Learningmentioning

confidence: 99%

DisCover: Disentangled Music Representation Learning for Cover Song Identification

Xun

Zhang

Yang

et al. 2023

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…• embedding pooling such as Compound Word (Hsiao et al, 2021) et al, 2021), PopMag (Ren et al, 2020), Sym-phonyNet (Liu et al, 2022) or MMT (Dong et al, 2023). Embeddings of several tokens are merged with a pooling operation.…”

Section: Sequence Length Reduction Strategiesmentioning

confidence: 99%

“…the quality of generated music or the accuracy of classification tasks, and; 2) the efficiency of the models. The former is tackled with more expressive representations (Huang and Yang, 2020;Kermarec et al, 2022;von Rütte et al, 2023;Fradet et al, 2021), and the latter by representations based on either token combinations (Payne, 2019;Donahue et al, 2019), or embedding pooling (Hsiao et al, 2021;Zeng et al, 2021;Ren et al, 2020;Dong et al, 2023), which reduce the overall sequence length.…”

Section: Introductionmentioning

confidence: 99%

Byte Pair Encoding for Symbolic Music

Fradet,

Gutowski,

Chhel

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

When used with deep learning, the symbolic music modality is often coupled with language model architectures. To do so, the music needs to be tokenized, i.e. converted into a sequence of discrete tokens. This can be achieved by different approaches, as music can be composed of simultaneous tracks, of simultaneous notes with several attributes. Until now, the proposed tokenizations rely on small vocabularies of tokens describing the note attributes and time events, resulting in fairly long token sequences, and a sub-optimal use of the embedding space of language models. Recent research has put efforts on reducing the overall sequence length by merging embeddings or combining tokens. In this paper, we show that Byte Pair Encoding, a compression technique widely used for natural language, significantly decreases the sequence length while increasing the vocabulary size. By doing so, we leverage the embedding capabilities of such models with more expressive tokens, resulting in both better results and faster inference in generation and classification tasks. The source code is shared on Github 1 , along with a companion website 2 . Finally, BPE is directly implemented in MidiTok 3 , allowing the reader to easily benefit from this method.

show abstract

“…However, with recent advancements in the field of speech synthesis, deep learning-based approaches have gained significant traction. This means that instead of relying on the conventional approach comprising multiple subprocesses, there has been a notable shift toward the development of end-to-end TTS technology (Ren et al, 2020), which is supported by trained models.…”

Section: Introductionmentioning

confidence: 99%

Attention-based speech feature transfer between speakers

Lee,

Cho,

Kwon

2024

Front. Artif. Intell.

View full text Add to dashboard Cite

In this study, we propose a simple yet effective method for incorporating the source speaker's characteristics in the target speaker's speech. This allows our model to generate the speech of the target speaker with the style of the source speaker. To achieve this, we focus on the attention model within the speech synthesis model, which learns various speaker features such as spectrogram, pitch, intensity, formant, pulse, and voice breaks. The model is trained separately using datasets specific to the source and target speakers. Subsequently, we replace the attention weights learned from the source speaker's dataset with the attention weights from the target speaker's model. Finally, by providing new input texts to the target model, we generate the speech of the target speaker with the styles of the source speaker. We validate the effectiveness of our model through similarity analysis utilizing five evaluation metrics and showcase real-world examples.

show abstract

PopMAG

Cited by 52 publications

References 12 publications

DisCover: Disentangled Music Representation Learning for Cover Song Identification

DisCover: Disentangled Music Representation Learning for Cover Song Identification

Byte Pair Encoding for Symbolic Music

Attention-based speech feature transfer between speakers

Contact Info

Product

Resources

About