As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show stateof-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks. Our focus is on the particular advantage of DMNs under low-resource conditions with limited transcribed speech. We extend DMNs to hybrid and bottleneck feature systems, and explore optimal network structures (number of maxout layers, pooling strategy, etc) for both setups. On the newly released Babel corpus, behaviors of DMNs are extensively studied under different levels of data availability. Experiments show that DMNs improve low-resource speech recognition significantly. Moreover, DMNs introduce sparsity to their hidden activations and thus can act as sparse feature extractors.
Multimedia Event Detection (MED) is an annual task in the NIST TRECVID evaluation, and requires participants to build indexing and retrieval systems for locating videos in which certain predefined events are shown. Typical systems focus heavily on the use of visual data. Audio data, however, also contains rich information that can be effectively used for video retrieval, and MED could benefit from the attention of researchers in audio analysis. We present several systems for performing MED using only audio data, report the results of each system on the TRECVID MED 2011 development dataset, and compare the strengths and weaknesses of each approach.
In this paper, we present recent experiments on using Artificial Neural Networks (ANNs), a new "delayed" approach to speech vs. non-speech segmentation, and extraction of largescale pooling feature (LSPF) for detecting "events" within consumer videos, using the audio channel only. A "event" is defined to be a sequence of observations in a video, that can be directly observed or inferred. Ground truth is given by a semantic description of the event, and by a number of example videos. We describe and compare several algorithmic approaches, and report results on the 2013 TRECVID Multimedia Event Detection (MED) task, using arguably the largest such research set currently available. The presented system achieved the best results in most audio-only conditions.While the overall finding is that MFCC features perform best, we find that ANN as well as LSP features provide complementary information at various levels of temporal resolution. This paper provides analysis of both low-level and high-level features, investigating their relative contributions to overall system performance.Index Terms-acoustic event detection, computational acoustic scene analysis, multimedia retrieval
In this paper we present our audio based system for detecting "events" within consumer videos (e.g. You Tube) and report our experiments on the TRECVID Multimedia Event Detection (MED) task and development data. Codebook or bag-of-words models have been widely used in text, visual and audio domains and form the state-of-the-art in MED tasks. The overall effectiveness of these models on such datasets depends critically on the choice of low-level features, clustering approach, sampling method, codebook size, weighting schemes and choice of classifier. In this work we empirically evaluate several approaches to model expressive and robust audio codebooks for the task of MED while ensuring compactness. First, we introduce the Large Scale Pooling Features (LSPF) and Stacked Cepstral Features for encoding local temporal information in audio codebooks. Second, we discuss several design decisions for generating and representing expressive audio codebooks and show how they scale to large datasets. Third, we apply text based techniques like Latent Dirichlet Allocation (LDA) to learn acoustictopics as a means of providing compact representation while maintaining performance. By aggregating these decisions into our model, we obtained 11% relative improvement over our baseline audio systems.
The audio semantic concepts (sound events) play important roles in audio-based content analysis. How to capture the semantic information effectively from the complex occurrence pattern of sound events in YouTube quality videos is a challenging problem. This paper presents a novel framework to handle the complex situation for semantic information extraction in real-world videos and evaluate through the NIST multimedia event detection task (MED). We calculate the occurrence confidence matrix of sound events and explore multiple strategies to generate clip-level semantic features from the matrix. We evaluate the performance using TRECVID2011 MED dataset. The proposed method outperforms previous HMM-based system. The late fusion experiment with the low-level features and text feature (ASR) shows that audio semantic concepts capture complementary information in the soundtrack.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.