Automatic classification of speech and music using neural networks

Khan, Mouhyemen; Al-Khatib, Wasfi G.; Moinuddin, Muhammad

doi:10.1145/1032604.1032620

Cited by 13 publications

(7 citation statements)

References 10 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LPC have been applied also in audio segmentation and general purpose audio retrieval, like in the works by Khan et al [68,69].…”

Section: Autoregression-based Frequency Featuresmentioning

confidence: 99%

“…It measures how quickly the power spectrum changes and it can be used to determine the timbre of an audio signal. This feature has been used for speech/music discrimination (like in Jiang et al [60], or in Khan et al [68,69]), musical instrument classification (Benetos et al [10]), music genre classification (Li et al [40], Lu et al [12], Tzanetakis and Cook [28], Wang et al [9]) and environmental sound recognition (see Peltonen et al [18]). • Spectral peaks: this feature was defined by Wang [8] as constellation maps that show the most relevant energy bin components in the time-frequency signal representation.…”

Section: Stft-based Frequency Featuresmentioning

confidence: 99%

See 1 more Smart Citation

A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds

2016

View full text Add to dashboard Cite

Endowing machines with sensing capabilities similar to those of humans is a prevalent quest in engineering and computer science. In the pursuit of making computers sense their surroundings, a huge effort has been conducted to allow machines and computers to acquire, process, analyze and understand their environment in a human-like way. Focusing on the sense of hearing, the ability of computers to sense their acoustic environment as humans do goes by the name of machine hearing. To achieve this ambitious aim, the representation of the audio signal is of paramount importance. In this paper, we present an up-to-date review of the most relevant audio feature extraction techniques developed to analyze the most usual audio signals: speech, music and environmental sounds. Besides revisiting classic approaches for completeness, we include the latest advances in the field based on new domains of analysis together with novel bio-inspired proposals. These approaches are described following a taxonomy that organizes them according to their physical or perceptual basis, being subsequently divided depending on the domain of computation (time, frequency, wavelet, image-based, cepstral, or other domains). The description of the approaches is accompanied with recent examples of their application to machine hearing related problems.

show abstract

“…LPC have been applied also in audio segmentation and general purpose audio retrieval, like in the works by Khan et al [68,69].…”

Section: Autoregression-based Frequency Featuresmentioning

confidence: 99%

Section: Stft-based Frequency Featuresmentioning

confidence: 99%

A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds

2016

View full text Add to dashboard Cite

show abstract

“…Kahn and al. Earlier, [22] proposed the wavelet parameterization for speech/music detection. But he used only two values per frame to perform speech/music classification: the mean and the variance of the discrete wavelet transform coefficients.…”

Section: Accepted Manuscriptmentioning

confidence: 99%

“…Nevertheless, some systems use other speech/music classifiers, such as Multi-Layer Perceptron [22], [24], Maximum A Posteriori classifier [42], k-Nearest Neighbors [42], and different hybrid systems: MLP/SVM (Support Vector Machine) [14], MLP/HMM [1].…”

Section: Introductionmentioning

confidence: 99%

A wavelet-based parameterization for speech/music discrimination

Didiot

Illina

Fohr

et al. 2010

Computer Speech & Language

View full text Add to dashboard Cite

RésuméThis paper addresses the problem of parameterization for speech/music discrimination. The current successful parameterization based on cepstral coefficients uses the Fourier transformation (FT), which is well adapted for stationary signals. In order to take into account the non stationarity of music/speech signals, this work proposes to study wavelet-based signal decomposition instead of FT. Three wavelet families and several numbers of vanishing moments have been evaluated. Different types of energy, calculated for each frequency band obtained from wavelet decomposition, are studied. Static, dynamic and long-term parameters were evaluated. The proposed parameterization are integrated into two class/non-class classifiers: one for speech/non-speech, one for music/non-music. Different experiments on realistic corpora, including different styles of speech and music (Broadcast News, Entertainment, Scheirer), illustrate the performance of the proposed parameterization, especially for music/non-music discrimination. Our parameterization yielded a significant reduction of the error rate. More than 30% relative improvement was obtained for the envisaged tasks compared to MFCC parameterization.

show abstract

“…Although they are relatively simple to calculate, they can be representative of the feature sequence. Except for mean and variance, which are of high importance (see [30,31]), we also make use of three percentiles. They reflect upon the value below of which a certain percent of observations may be found.…”

Section: Computation Of Short-term Statisticsmentioning

confidence: 99%

Exploiting Temporal Feature Integration for Generalized Sound Recognition

Ntalampiras

Potamitis

Fakotakis

2009

EURASIP J. Adv. Signal Process.

View full text Add to dashboard Cite

This paper presents a methodology that incorporates temporal feature integration for automated generalized sound recognition. Such a system can be of great use to scene analysis and understanding based on the acoustic modality. The performance of three feature sets based on Mel filterbank, MPEG-7 audio protocol, and wavelet decomposition is assessed. Furthermore we explore the application of temporal integration using the following three different strategies: (a) short-term statistics, (b) spectral moments, and (c) autoregressive models. The experimental setup is thoroughly explained and based on the concurrent usage of professional sound effects collections. In this way we try to form a representative picture of the characteristics of ten sound classes. During the first phase of our implementation, the process of audio classification is achieved through statistical models (HMMs) while a fusion scheme that exploits the models constructed by various feature sets provided the highest average recognition rate. The proposed system not only uses diverse groups of sound parameters but also employs the advantages of temporal feature integration.

show abstract

Automatic classification of speech and music using neural networks

Cited by 13 publications

References 10 publications

A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds

A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds

A wavelet-based parameterization for speech/music discrimination

Exploiting Temporal Feature Integration for Generalized Sound Recognition

Contact Info

Product

Resources

About