“…In addition, the common hand-crafted features used for acoustic scene classification (or clustering) include the logarithm mel-band energy, mel frequency cepstral coefficient (MFCC), spectral flux, spectrogram, Gabor filterbank, cochleograms, I-vector, histogram of gradients features [12]- [15], the histogram of gradients of timefrequency representations (HGTR) [14], hash features [16], and local binary patterns [17], [18]. In recent years, some transformed features using matrix factorization [19], [20] and deep neural network [6], [11], [21], are used to address the lack of flexibility of hand-crafted features. Hand-crafted or shallow features did not effectively represent the property differences among various classes of acoustic scenes, and thus their performance was inferior to that of deep transformed features learned by deep neural networks, such as convolutional neural network (CNN) [11], [22]- [25].…”