Audio event detection from acoustic unit occurrence patterns

Kumar, Anurag; Dighe, Pranay; Singh, Rita; Chaudhuri, Sourish; Raj, Bhiksha

doi:10.1109/icassp.2012.6287923

Cited by 35 publications

(30 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Number of codebooks and words per codebook: Although increasing the number of codebooks (N ∈ [1,3,9]) or the number of words extracted per codebooks (K ∈ [1,3]) increases the number of combinations and should have the same effect as increasing the size of the codebooks, we observe that recall tends to increase, and precision slightly dropps. We believe that using several words per codebook, i.e., K = 3, drastically improves the description capabilities of audio words.…”

Section: B Study On the Parametersmentioning

confidence: 87%

“…The approach followed in [9] is the closest to ours. In this article, the 2011 TRECVID Multimedia Event Detection (MED) data [10] is used to characterize user generated video excerpts coming from Internet and to detect audio events for which annotations are provided.…”

Section: Background On Audio Wordsmentioning

confidence: 99%

“…However, one would expect the precision value to be rather low, due to unbalanced classes in the data. Our system differs from [9] in several ways. First, the dictionary building mechanism is rather different.…”

Section: Background On Audio Wordsmentioning

confidence: 99%

See 2 more Smart Citations

Audio event detection in movies using multiple audio words and contextual Bayesian networks

Penet

Demarty

Gravier

et al. 2013

2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI)

View full text Add to dashboard Cite

Abstract-This article investigates a novel use of the wellknown audio words representations to detect specific audio events, namely gunshots and explosions, in order to get more robustness towards soundtrack variability in Hollywood movies. An audio stream is processed as a sequence of stationary segments. Each segment is described by one or several audio words obtained by applying product quantization to standard features. Such a representation using multiple audio words constructed via product quantisation is one of the novelties described in this work. Based on this representation, Bayesian networks are used to exploit the contextual information in order to detect audio events. Experiments are performed on a comprehensive set of 15 movies, made publicly available. Results are comparable to the state of the art results obtained on the same dataset but show increased robustness to decision thresholds, however limiting the range of possible operating points in some conditions. Late fusion provides a solution to this issue.

show abstract

Section: B Study On the Parametersmentioning

confidence: 87%

Section: Background On Audio Wordsmentioning

confidence: 99%

See 1 more Smart Citation

Audio event detection in movies using multiple audio words and contextual Bayesian networks

Penet

Demarty

Gravier

et al. 2013

2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI)

View full text Add to dashboard Cite

show abstract

“…It is a general framework for obtaining a fixed length representations for audio clips and can be done on a variety of low-level audio features such as MFCCs [24], autoencoder based features [3] and normalized spectral features [21] to name a few. An alternate approach to obtaining bags of words is used in [18] -sound recordings are first decomposed into sequence of basic sound units called "Acoustic Unit Descriptors" (AUDS), which are themselves learned in an unsupervised manner. Bags of words are then obtained as bags of AUDs.…”

Section: Related Workmentioning

confidence: 99%

Audio Event Detection using Weakly Labeled Data

Kumar

Raj

2016

Proceedings of the 24th ACM International Conference on Multimedia

Self Cite

141

135

View full text Add to dashboard Cite

Acoustic event detection is essential for content analysis and description of multimedia recordings. The majority of current literature on the topic learns the detectors through fully-supervised techniques employing strongly labeled data. However, the labels available for majority of multimedia data are generally weak and do not provide sufficient detail for such methods to be employed. In this paper we propose a framework for learning acoustic event detectors using only weakly labeled data. We first show that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem. We then suggest two frameworks for solving multiple-instance learning, one based on support vector machines, and the other on neural networks. The proposed methods can help in removing the time consuming and expensive process of manually annotating data to facilitate fully supervised learning. Moreover, it can not only detect events in a recording but can also provide temporal locations of events in the recording. This helps in obtaining a complete description of the recording and is notable since temporal information was never known in the first place in weakly labeled data.

show abstract

“…The ensemble-based learning approaches such as random forest and density forest [6] have been successfully employed for several tasks in the audio domain such as emotion recognition [19,18], paralinguistic event detection [1] and audio event detection [11]. In this work, a segmentation forest is utilised as a special case of the random forest approach.…”

Section: Bic-based Speaker Segmentation Using Segmentation Forestmentioning

confidence: 99%

Utilising Tree-Based Ensemble Learning for Speaker Segmentation

Abou-Zleikha

Christensen

Jensen

2014

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications

View full text Add to dashboard Cite

Abstract. In audio and speech processing, accurate detection of the changing points between multiple speakers in speech segments is an important stage for several applications such as speaker identification and tracking. Bayesian Information Criteria (BIC)-based approaches are the most traditionally used ones as they proved to be very effective for such task. The main criticism levelled against BIC-based approaches is the use of a penalty parameter in the BIC function. The use of this parameters consequently means that a fine tuning is required for each variation of the acoustic conditions. When tuned for a certain condition, the model becomes biased to the data used for training limiting the model's generalisation ability.In this paper, we propose a BIC-based tuning-free approach for speaker segmentation through the use of ensemble-based learning. A forest of segmentation trees is constructed in which each tree is trained using a sampled version of the speech segment. During the tree construction process, a set of randomly selected points in the input sequence is examined as potential segmentation points. The point that yields the highest ΔBIC is chosen and the same process is repeated for the resultant left and right segments. The tree is constructed where each node corresponds to the highest ΔBIC with the associated point index. After building the forest and using all trees, the accumulated ΔBIC for each point is calculated and the positions of the local maximums are considered as speaker changing points. The proposed approach is tested on artificially created conversations from the TIMIT database. The approach proposed show very accurate results comparable to those achieved by the-state-of-the-art methods with a 9% (absolute) higher F1 compared with the standard ΔBIC with optimally tuned penalty parameter.

show abstract

Audio event detection from acoustic unit occurrence patterns

Cited by 35 publications

References 10 publications

Audio event detection in movies using multiple audio words and contextual Bayesian networks

Audio event detection in movies using multiple audio words and contextual Bayesian networks

Audio Event Detection using Weakly Labeled Data

Utilising Tree-Based Ensemble Learning for Speaker Segmentation

Contact Info

Product

Resources

About