Convolutional Neural Networks Using Log Mel-Spectrogram Separation for Audio Event Classification with Unknown Devices

Seo, Soonshin; Kim, Changmin; Kim, Ji‐Hwan

doi:10.13052/jwe1540-9589.21216

Cited by 7 publications

(6 citation statements)

References 13 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this experiment, vanilla MobileNet v2 was less accurate than VGG-Resnet, although the MobileNet v2 model performed better after data augmentation. VGG-Resnet had a comparable performance in the audio event classification domain [32]. MobileNet v2 with time and frequency masking was considerably more accurate than the VGG-Resnet model.…”

Section: Resultsmentioning

confidence: 94%

“…There were five versions of the AEC model: The results were compared with those for VGG-Resnet, which is an ensemble version of a CNN-based VGG network (VGGnet) [32] and a residual network [33].…”

Section: Resultsmentioning

confidence: 99%

“…CNNs are widely used to model audio events from audio feature vectors. This type of neural network was mainly used to transform acoustic feature vectors into spectrograms and then to train them [32]. The correlations between local information and feature vectors are learned by the CNN.…”

Section: Audio Event and Audio Scene Classificationmentioning

confidence: 99%

See 2 more Smart Citations

Deep Neural Networks-based Classification Methodologies of Speech, Audio and Music, and its Integration for Audio Metadata Tagging

Park

Chung

Kim

2023

JWE

View full text Add to dashboard Cite

Videos contain visual and auditory information. Visual information in a video can include images of people, objects, and the landscape, whereas auditory information includes voices, sound effects, background music, and the soundscape. The audio content can provide detailed information on the story by conducting a voice and atmosphere analysis of the sound effects and soundscape. Metadata tags represent the results of a media analysis as text. The tags can classify video content on social networking services, like YouTube. This paper presents the methodologies of speech, audio, and music processing. Also, we propose integrating these audio tagging methods and applying them in an audio metadata generation system for video storytelling. The proposed system automatically creates metadata tags based on speech, sound effects, and background music information from the audio input. The proposed system comprises five subsystems: (1) automatic speech recognition, which generates text from the linguistic sounds in the audio, (2) audio event classification for the type of sound effect, (3) audio scene classification for the type of place from the soundscape, (4) music detection for the background music, and (5) keyword extraction from the automatic speech recognition results. First, the audio signal is converted into a suitable form, which is subsequently combined from each subsystem to create metadata for the audio content. We evaluated the proposed system using video logs (vlogs) on YouTube. The proposed system exhibits a similar accuracy to handcrafted metadata for the audio content, and for a total of 104 YouTube vlogs, achieves an accuracy of 65.83%.

show abstract

Section: Resultsmentioning

confidence: 94%

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Deep Neural Networks-based Classification Methodologies of Speech, Audio and Music, and its Integration for Audio Metadata Tagging

Park

Chung

Kim

2023

JWE

View full text Add to dashboard Cite

show abstract

“…The problem caused by differences in frequency‐domain emphasis by various audio input devices was addressed in this study's audio preprocessing step using a log‐Mel spectrogram because it can provide more accurate and detailed characteristics in the high‐ and low‐frequency domains than the Mel‐spectrogram [48]. In addition, log‐Mel spectrograms can improve performance, as demonstrated by the DCASE 2020 challenge for audio scene classification [48]. At a sampling rate of 16000, each 3‐s sample was processed to create a single log‐Mel time‐frequency spectrogram.…”

Section: Methodsmentioning

confidence: 99%

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Jeon,

Lee,

Yeo

et al. 2024

ETRI Journal

View full text Add to dashboard Cite

Exposure to varied noisy environments impairs the recognition performance of artificial intelligence‐based speech recognition technologies. Degraded‐performance services can be utilized as limited systems that assure good performance in certain environments, but impair the general quality of speech recognition services. This study introduces an audiovisual speech recognition (AVSR) model robust to various noise settings, mimicking human dialogue recognition elements. The model converts word embeddings and log‐Mel spectrograms into feature vectors for audio recognition. A dense spatial–temporal convolutional neural network model extracts features from log‐Mel spectrograms, transformed for visual‐based recognition. This approach exhibits improved aural and visual recognition capabilities. We assess the signal‐to‐noise ratio in nine synthesized noise environments, with the proposed model exhibiting lower average error rates. The error rate for the AVSR model using a three‐feature multi‐fusion method is 1.711%, compared to the general 3.939% rate. This model is applicable in noise‐affected environments owing to its enhanced stability and recognition rate.

show abstract

“…CNNs excel in classifying audio signals across a spectrum of categories, encompassing speech, music, and environmental sounds. Their proficiency extends to tasks such as speech recognition, speaker identification, and even emotion recognition [18]- [22]. On the other hand, RNNs demonstrate prowess in audio classification and segmentation, effectively disassembling and categorizing audio data with remarkable accuracy [23]- [27].…”

Section: Introductionmentioning

confidence: 99%

An Efficient Approach for Securing Audio Data in AI Training with Fully Homomorphic Encryption

Nguyen,

Phan,

Zhang

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

Convolutional Neural Networks Using Log Mel-Spectrogram Separation for Audio Event Classification with Unknown Devices

Cited by 7 publications

References 13 publications

Deep Neural Networks-based Classification Methodologies of Speech, Audio and Music, and its Integration for Audio Metadata Tagging

Deep Neural Networks-based Classification Methodologies of Speech, Audio and Music, and its Integration for Audio Metadata Tagging

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

An Efficient Approach for Securing Audio Data in AI Training with Fully Homomorphic Encryption

Contact Info

Product

Resources

About