Robust audio-codebooks for large-scale event detection in consumer videos

Rawat, Shourabh; Schulam, Peter; Burger, Susanne; Ding, Duo; Wang, Yipei; Metze, Florian

doi:10.21437/interspeech.2013-654

Cited by 15 publications

(6 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Exponential Chi-square (χ 2 ) kernels in form of exp(−γd(x, y)) have been known to work remarkably well with histogram features, including for detection of acoustic concepts [34] [35]. d(x, y) is χ 2 distance.…”

Section: Experiments and Resultsmentioning

confidence: 99%

Audio event and scene recognition: A unified approach using strongly and weakly labeled data

Raj

Kumar

2017

2017 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

Abstract-In this paper we propose a novel learning framework called Supervised and Weakly Supervised Learning where the goal is to learn simultaneously from weakly and strongly labeled data. Strongly labeled data can be simply understood as fully supervised data where all labeled instances are available. In weakly supervised learning only data is weakly labeled which prevents one from directly applying supervised learning methods. Our proposed framework is motivated by the fact that a small amount of strongly labeled data can give considerable improvement over only weakly supervised learning. The primary problem domain focus of this paper is acoustic event and scene detection in audio recordings. We first propose a naive formulation for leveraging labeled data in both forms. We then propose a more general framework for Supervised and Weakly Supervised Learning (SWSL). Based on this general framework, we propose a graph based approach for SWSL. Our main method is based on manifold regularization on graphs in which we show that the unified learning can be formulated as a constraint optimization problem which can be solved by iterative concave-convex procedure (CCCP). Our experiments show that our proposed framework can address several concerns of audio content analysis using weakly labeled data.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

Audio event and scene recognition: A unified approach using strongly and weakly labeled data

Raj

Kumar

2017

2017 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

show abstract

“…Pancoast and Akbacak used k-means in their original study [16]; however, due to the large number of frames to be clustered, the runtime of this approach is very high. Rawat et al offer simple random sampling [19]; its runtime is marginally better than the k-means, and it does not really affect the performance. Later, Arthur et al applied k-means++ clustering [1], a cluster center initialization procedure, which was used instead of completely random sampling, hence the distribution of cluster centers became more balanced.…”

Section: Parameters Of the Boaw Methodsmentioning

confidence: 99%

Using the Bag-of-Audio-Words approach for emotion recognition

Wang

Gosztolya

2022

Acta Universitatis Sapientiae, Informatica

View full text Add to dashboard Cite

The problem of varying length recordings is a well-known issue in paralinguistics. We investigated how to resolve this problem using the bag-of-audio-words feature extraction approach. The steps of this technique involve preprocessing, clustering, quantization and normalization. The bag-of-audio-words technique is competitive in the area of speech emotion recognition, but the method has several parameters that need to be precisely tuned for good efficiency. The main aim of our study was to analyse the effectiveness of bag-of-audio-words method and try to find the best parameter values for emotion recognition. We optimized the parameters one-by-one, but built on the results of each other. We performed the feature extraction, using openSMILE. Next we transformed our features into same-sized vectors with openXBOW, and finally trained and evaluated SVM models with 10-fold-crossvalidation and UAR. In our experiments, we worked with a Hungarian emotion database. According to our results, the emotion classification performance improves with the bag-of-audio-words feature representation. Not all the BoAW parameters have the optimal settings but later we can make clear recommendations on how to set bag-of-audio-words parameters for emotion detection tasks.

show abstract

“…In the BoAW approach, the numerical LLDs or alternatively the higher level derived features extracted from the SnS data will first undergo a vector quantisation (VQ) step, which employs a codebook of template LLDs which was previously learnt from a certain number of training data [74]. For generating the codebook, Schmitt et al and their followers used the initialisation step of k-means++ clustering [104], which is comparable to an optimised random sampling of LLDs [105] instead of the traditional k-means clustering [106], [107] method to improve the computational speed and at the same time guarantees a comparable performance. To improve the robustness of this approach, the N a (assignment number) words (i.e., LLDs) with the lowest Euclidean distance are considered instead of assigning each LLD to only the most similar word in the codebook.…”

Section: B Higher Representationsmentioning

confidence: 99%

Can Machine Learning Assist Locating the Excitation of Snore Sound? A Review

Qian

Janott

Schmitt

et al. 2021

IEEE J. Biomed. Health Inform.

View full text Add to dashboard Cite

In the past three decades, snoring (affecting more than 30 % adults of the UK population) has been increasingly studied in the transdisciplinary research community involving medicine and engineering. Early work demonstrated that, the snore sound can carry important information about the status of the upper airway, which facilitates the development of non-invasive acoustic based approaches for diagnosing and screening of obstructive sleep apnoea and other sleep disorders. Nonetheless, there are more demands from clinical practice on finding methods to localise the snore sound's excitation rather than only detecting sleep disorders. In order to further the relevant studies and attract more attention, we provide a comprehensive review on the state-of-the-art techniques from machine learning to automatically classify snore sounds. First, we introduce the background and definition of the problem. Second, we illustrate the current work in detail and explain potential applications. Finally, we discuss the limitations and challenges in the snore sound classification task. Overall, our review provides a comprehensive guidance for researchers to contribute to this area.

show abstract

Robust audio-codebooks for large-scale event detection in consumer videos

Cited by 15 publications

References 17 publications

Audio event and scene recognition: A unified approach using strongly and weakly labeled data

Audio event and scene recognition: A unified approach using strongly and weakly labeled data

Using the Bag-of-Audio-Words approach for emotion recognition

Can Machine Learning Assist Locating the Excitation of Snore Sound? A Review

Contact Info

Product

Resources

About