Xingjian Du scite author profile

We present in this paper ByteCover, which is a new feature learning method for cover song identification (CSI). Byte-Cover is built based on the classical ResNet model, and two major improvements are designed to further enhance the capability of the model for CSI. In the first improvement, we introduce the integration of instance normalization (IN) and batch normalization (BN) to build IBN blocks, which are major components of our ResNet-IBN model. With the help of the IBN blocks, our CSI model can learn features that are invariant to the changes of musical attributes such as key, tempo, timbre and genre, while preserving the version information. In the second improvement, we employ the BN-Neck method to allow a multi-loss training and encourage our method to jointly optimize a classification loss and a triplet loss, and by this means, the inter-class discrimination and intra-class compactness of cover songs, can be ensured at the same time. A set of experiments demonstrated the effectiveness and efficiency of ByteCover on multiple datasets, and in the Da-TACOS dataset, ByteCover outperformed the best competitive system by 18.0%.

show abstract

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Chen¹,

Du²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Speech Enhancement with Weakly Labelled Data from AudioSet

Kong¹,

Liu²,

Du³

et al. 2021

View full text Add to dashboard Cite

Speech enhancement with weakly labelled data from AudioSet

Kong

Liu²,

et al. 2021

Preprint

View full text Add to dashboard Cite

Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signal. Recently, neural networks based methods have been applied to speech enhancement. However, many neural network based methods require noisy and clean speech pairs for training. We propose a speech enhancement framework that can be trained with large-scale weakly labelled AudioSet dataset. Weakly labelled data only contain audio tags of audio clips, but not the onset or offset times of speech. We first apply pretrained audio neural networks (PANNs) to detect anchor segments that contain speech or sound events in audio clips. Then, we randomly mix two detected anchor segments containing speech and sound events as a mixture, and build a conditional source separation network using PANNs predictions as soft conditions for speech enhancement. In inference, we input a noisy speech signal with the one-hot encoding of "Speech" as a condition to the trained system to predict enhanced speech. Our system achieves a PESQ of 2.28 and an SSNR of 8.75 dB on the VoiceBank-DEMAND dataset, outperforming the previous SEGAN system of 2.16 and 7.73 dB respectively.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xingjian Du

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Bytecover: Cover Song Identification Via Multi-Loss Training

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Speech Enhancement with Weakly Labelled Data from AudioSet

Speech enhancement with weakly labelled data from AudioSet

Contact Info

Product

Resources

About