Boqing Zhu scite author profile

Audio scene classification, the problem of predicting class labels of audio scenes, has drawn lots of attention during the last several years. However, it remains challenging and falls short of accuracy and efficiency. Recently, Convolutional Neural Network (CNN)-based methods have achieved better performance with comparison to the traditional methods. Nevertheless, conventional single channel CNN may fail to consider the fact that additional cues may be embedded in the multi-channel recordings. In this paper, we explore the use of Multi-channel CNN for the classification task, which aims to extract features from different channels in an end-to-end manner. We conduct the evaluation compared with the conventional CNN and traditional Gaussian Mixture Model-based methods. Moreover, to improve the classification accuracy further, this paper explores the using of mixup method. In brief, mixup trains the neural network on linear combinations of pairs of the representation of audio scene examples and their labels. By employing the mixup approach for data augmentation, the novel model can provide higher prediction accuracy and robustness in contrast with previous models, while the generalization error can also be reduced on the evaluation data.

Learning Environmental Sounds with Multi-scale Convolutional Neural Network

Wang

Liu

et al. 2018

Deep learning has dramatically improved the performance of sounds recognition. However, learning acoustic models directly from the raw waveform is still challenging. Current waveform-based models generally use time-domain convolutional layers to extract features. The features extracted by single size filters are insufficient for building discriminative representation of audios. In this paper, we propose multi-scale convolution operation, which can get better audio representation by improving the frequency resolution and learning filters cross all frequency area. For leveraging the waveform-based features and spectrogram-based features in a single model, we introduce twophase method to fuse the different features. Finally, we propose a novel end-to-end network called WaveMsNet based on the multi-scale convolution operation and two-phase method. On the environmental sounds classification datasets ESC-10 and ESC-50, the classification accuracies of our WaveMsNet achieve 93.75% and 79.10% respectively, which improve significantly from the previous methods.

Sensors and Actuators A: Physical

Design, analysis and experiment of a novel ring vibratory gyroscope

Tao

Xiao

et al. 2011

Learning Environmental Sounds with Multi-scale Convolutional Neural Network

Zhu¹,

Wang²,

Liu³

et al. 2018

Preprint

An Adversarial Feature Distillation Method for Audio Classification

Gao

et al. 2019

IEEE Access

Is the Development of China’s Financial Inclusion Sustainable? Evidence from a Perspective of Balance

2018

Balance plays an important role in the sustainable development of China's financial inclusion. First, this paper reports the entropy weight method used to construct a financial inclusion index (FII) and measure the level of development of financial inclusion in China's regions. Second, the concept of the Gini coefficient of financial inclusion is proposed and the structural balance of China's financial inclusion is shown, as calculated by using this Gini coefficient. Third, we report the use of a dynamic shift-share model to further discuss the development balance of the financial inclusion of China's regions. The results show that there is an imbalance in the development of financial inclusion in China's regions. For 2006-2016, the Gini coefficient and the structural balance of China's financial inclusion show a significant downward trend. The gap of the financial inclusion development between regions is narrowing and the structure of China's financial inclusion tends to be reasonable. The penetration dimension is at a structural disadvantage. Availability and usage dimension are at a structural advantage, which can effectively promote the development of China's financial inclusion. In the future, the government should establish a more balanced financial inclusion development mechanism, making full use of structural advantages of the availability and usage of financial services to promote the sustainable development of China's financial inclusion.

General audio tagging with ensembling convolutional neural networks and statistical features

Kong

et al. 2019

Audio tagging aims to infer descriptive labels from audio clips. Audio tagging is challenging due to the limited size of data and noisy labels. In this paper, we describe our solution for the DCASE 2018 Task 2 general audio tagging challenge. The contributions of our solution include: We investigated a variety of convolutional neural network architectures to solve the audio tagging task. Statistical features are applied to capture statistical patterns of audio features to improve the classification performance. Ensemble learning is applied to ensemble the outputs from the deep classifiers to utilize complementary information. a sample reweight strategy is employed for ensemble training to address the noisy label problem. Our system achieves a mean average precision (mAP@3) of 0.958, outperforming the baseline system of 0.704. Our system ranked the 1st and 4th out of 558 submissions in the public and private leaderboard of DCASE 2018 Task 2 challenge. Our codes are available at https://github.com/Cocoxili/DCASE2018Task2/.

Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features

Wang

et al. 2018

Motivated by the fact that characteristics of different sound classes are highly diverse in different temporal scales and hierarchical levels, a novel deep convolutional neural network (CNN) architecture is proposed for the environmental sound classification task. This network architecture takes raw waveforms as input, and a set of separated parallel CNNs are utilized with different convolutional filter sizes and strides, in order to learn feature representations with multi-temporal resolutions. On the other hand, the proposed architecture also aggregates hierarchical features from multi-level CNN layers for classification using direct connections between convolutional layers, which is beyond the typical single-level CNN features employed by the majority of previous studies. This network architecture also improves the flow of information and avoids vanishing gradient problem. The combination of multi-level features boosts the classification performance significantly. Comparative experiments are conducted on two datasets: the environmental sound classification dataset (ESC-50), and DCASE 2017 audio scene classification dataset. Results demonstrate that the proposed method is highly effective in the classification tasks by employing multi-temporal resolution and multi-level features, and it outperforms the previous methods which only account for single-level features.