Multi-View Audio And Music Classification

Phan, Huy; Nguyêݱn, Huy L.; Chén, Oliver Y.; Pham, Lam; Koch, Philipp; McLoughlin, Ian; Mertins, Alfred

doi:10.1109/icassp39728.2021.9414551

Cited by 17 publications

(5 citation statements)

References 21 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As using ensemble is a rule of thumb to improve the ASC performance and shows effective to deal with the issue of mismatched recording devices [50], [16], [17], [51], [52], [53], [54], we therefore apply an ensemble of multiple spectrogram inputs in this paper. In particular, we use three spectrograms: log-Mel [36], Gammatone (Gam) [55], and Constant Q Transform (CQT) [36].…”

Section: B Further Improve Asc Performance By An Ensemble Of Multiple...mentioning

confidence: 99%

Robust, General, and Low Complexity Acoustic Scene Classification Systems and An Effective Visualization for Presenting a Sound Scene Context

Pham¹,

Salovic²,

Jalali³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we present a comprehensive analysis of Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording from its acoustic signature.In particular, we firstly propose an inception-based and lowfootprint ASC model, referred to as the ASC baseline. The proposed ASC baseline is then compared with benchmark and high-complexity network architectures of MobileNetV1,

show abstract

Section: B Further Improve Asc Performance By An Ensemble Of Multiple...mentioning

confidence: 99%

Robust, General, and Low Complexity Acoustic Scene Classification Systems and An Effective Visualization for Presenting a Sound Scene Context

Pham¹,

Salovic²,

Jalali³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In our experimentation, we only used the text and audio modalities. We extracted two views (low-level features) from the audio modality: the raw audio signal (Raw) and the Mel-scale spectrogram (MEL), as suggested in [26].…”

Section: Data Set Descriptionsmentioning

confidence: 99%

Multi-modal Multi-view Clustering based on Non-negative Matrix Factorization

Khalafaoui

Grozavu

Mateï

et al. 2022

2022 IEEE Symposium Series on Computational Intelligence (SSCI)

View full text Add to dashboard Cite

By combining related objects, unsupervised machine learning techniques aim to reveal the underlying patterns in a data set.Non-negative Matrix Factorization (NMF) is a data mining technique that splits data matrices by imposing restrictions on the elements' non-negativity into two matrices: one representing the data partitions and the other to represent the cluster prototypes of the data set. This method has attracted a lot of attention and is used in a wide range of applications, including text mining, clustering, language modeling, music transcription, and neuroscience (gene separation). The interpretation of the generated matrices is made simpler by the absence of negative values. In this article, we propose a study on multi-modal clustering algorithms and present a novel method called multi-modal multi-view non-negative matrix factorization, in which we analyze the collaboration of several local NMF models. The experimental results show the value of the proposed approach, which was evaluated using a variety of data sets, and the obtained results are very promising compared to state of art methods.

show abstract

“…As applying an ensemble of either different types of input spectrograms [14], [15], [16], [17] or different learning models [18], [19], [20], [21], [22] has been a rule of thumb to enhance the performance of audio-based scene classification task performance, we therefore evaluate two ensemble methods, referred to as the multiple spectrogram strategy (e.g. Multiple spectrograms combines with one model) and the multiple model strategy (e.g.…”

Section: B Further Exploring Audio-based Frameworkmentioning

confidence: 99%

An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification

Pham¹,

Ngo²,

Nguyen³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper presents a task of audio-visual scene classification (SC) where input videos are classified into one of five real-life crowded scenes: 'Riot', 'Noise-Street', 'Firework-Event', 'Music-Event', and 'Sport-Atmosphere' . To this end, we firstly collect an audio-visual dataset (videos) of these five crowded contexts from Youtube (in-the-wild scenes). Then, a wide range of deep learning frameworks are proposed to deploy either audio or visual input data independently. Finally, results obtained from high-performed deep learning frameworks are fused to achieve the best accuracy score. Our experimental results indicate that audio and visual input factors independently contribute to the SC task's performance. Significantly, an ensemble of deep learning frameworks exploring either audio or visual input data can achieve the best accuracy of 95.7%.

show abstract

Multi-View Audio And Music Classification

Cited by 17 publications

References 21 publications

Robust, General, and Low Complexity Acoustic Scene Classification Systems and An Effective Visualization for Presenting a Sound Scene Context

Robust, General, and Low Complexity Acoustic Scene Classification Systems and An Effective Visualization for Presenting a Sound Scene Context

Multi-modal Multi-view Clustering based on Non-negative Matrix Factorization

An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification

Contact Info

Product

Resources

About