Sourish Chaudhuri scite author profile

We present CP-JKU submission to MediaEval 2019; a Receptive Field-(RF)-regularized and Frequency-Aware CNN approach for tagging music with emotion/mood labels. We perform an investigation regarding the impact of the RF of the CNNs on their performance on this dataset. We observe that ResNets with smaller receptive fields -originally adapted for acoustic scene classification -also perform well in the emotion tagging task. We improve the performance of such architectures using techniques such as Frequency Awareness and Shake-Shake regularization, which were used in previous work on general acoustic recognition tasks. 1 The source code is published at https

show abstract

Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

Roth

Chaudhuri

Klejch

et al. 2020

108

View full text Add to dashboard Cite

Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Pantofaru

Chaudhuri

et al. 2019

View full text Add to dashboard Cite

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) that will be released publicly to facilitate algorithm development and enable comparisons. The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio. We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset.

show abstract

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Roth¹,

Chaudhuri²,

Klejch³

et al. 2019

Preprint

View full text Add to dashboard Cite

Audio event detection from acoustic unit occurrence patterns

Kumar

Dighe

Singh

et al. 2012

View full text Add to dashboard Cite

In most real-world audio recordings, we encounter several types of audio events. In this paper, we develop a technique for detecting signature audio events, that is based on identifying patterns of occurrences of automatically learned atomic units of sound, which we call Acoustic Unit Descriptors or AUDs. Experiments show that the methodology works as well for detection of individual events and their boundaries in complex recordings.

show abstract

AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

Chaudhuri

Roth

Ellis

et al. 2018

View full text Add to dashboard Cite

Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio-and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on taskspecific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches and understand their strengths and weaknesses. In this paper, we describe a new dataset which we will release publicly containing densely labeled speech activity in YouTube videos 1 , with the goal of creating a shared, available dataset for this task. The labels in the dataset annotate three different speech activity conditions: clean speech, speech co-occurring with music, and speech cooccurring with noise, which enable analysis of model performance in more challenging conditions based on the presence of overlapping noise. We report benchmark performance numbers on AVA-Speech using off-the-shelf, state-of-the-art audio and vision models that serve as a baseline to facilitate future research.

show abstract

CNN Architectures for Large-Scale Audio Classification

Hershey¹,

Chaudhuri²,

Ellis³

et al. 2016

Preprint

View full text Add to dashboard Cite

Unsupervised hierarchical structure induction for deeper semantic analysis of audio

Chaudhuri

Raj

2013

View full text Add to dashboard Cite

Current audio analysis techniques rely on fairly shallow analysis of audio content, using symbols or patterns extracted directly from the observed acoustics. We hypothesize that the observed acoustics actually map to semantics in a hierarchical manner, and that the higher levels of this hierarchy correspond to increasingly higher-level semantics. In this paper, we present a model for deeper analysis of the observed acoustics, that induces a probabilistic tree structure depending on estimated constituent identities and contexts. Audio characterization using the deeper structure outperforms the standard shallow-feature based characterizations.

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.