Music onset detection is significant and essential for obtaining the high-level music features such as rhythm, beat, music paragraph and structure. The traditional methods for onset detection which employ Short Time Fourier Transform (STFT)-based or Wavelet Transform (WT)-based features to characterize music signal generally lack adaptiveness for representing the stationary and non-stationary part of the music signal. This will lead to the degraded performance for music note onset detection. To solve this problem, a new algorithm for note onset detection based on sparse decomposition is proposed. Firstly, the musical signals are sparsely decomposed with Matching Pursuit (MP), and then the hybrid detection algorithm which combines namely the Degree of Explanation (DE) and the Change of Partials (CP) is applied to the sparse representation of the music signal. Finally, a modified peak-picking algorithm is employed to generate onset vectors. The experiments on the dataset with 2050 onsets show that our results are superior to those of MIREX 2013. For the polyphonic music which is the most widely used form in our real life, the proposed algorithm has better performance than the other algorithms.
Singing voice detection is still a challenging task because the voice can be obscured by instruments having the same frequency band, and even the same timbre, produced by mimicking the mechanism of human singing. Because of the poor adaptability and complexity of feature engineering, there is a recent trend towards feature learning in which deep neural networks play the roles of feature extraction and classification. In this paper, we present two methods to explore the channel properties in the convolution neural network to improve the performance of singing voice detection by feature learning. First, channel attention learning is presented to measure the importance of a feature, in which two attention mechanisms are exploited, i.e., the scaled dot-product and squeeze-and-excitation. This method focuses on learning the importance of the feature map so that the neurons can place more attention on the more important feature maps. Second, the multi-scale representations are fed to the input channels, aiming at adding more information in terms of scale. Generally, different songs need different scales of a spectrogram to be represented, and multi-scale representations ensure the network can choose the best one for the task. In the experimental stage, we proved the effectiveness of the two methods based on three public datasets, with the accuracy performance increasing by up to 2.13 percent compared to its already high initial level.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.