Pradeep Yarlagadda scite author profile

Hough voting methods efficiently handle the high complexity of multiscale, category-level object detection in cluttered scenes. The primary weakness of this approach is however that mutually dependent local observations are independently voting for intrinsically global object properties such as object scale. All the votes are added up to obtain object hypotheses. The assumption is thus that object hypotheses are a sum of independent part votes. Popular representation schemes are, however, based on an overlapping sampling of semi-local image features with large spatial support (e.g. SIFT or geometric blur). Features are thus mutually dependent and we incorporate these dependences into probabilistic Hough voting by presenting an objective function that combines three intimately related problems: i) grouping of mutually dependent parts, ii) solving the correspondence problem conjointly for dependent parts, and iii) finding concerted object hypotheses using extended groups rather than based on local observations alone. Experiments successfully demonstrate that state-of-the-art Hough voting and even sliding windows are significantly improved by utilizing part dependences and jointly optimizing groups, correspondences, and votes.

show abstract

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Jain

Yarlagadda

Jyoti

et al. 2021

View full text Add to dashboard Cite

From Meaningful Contours to Discriminative Object Shape

Yarlagadda

Ommer

2012

View full text Add to dashboard Cite

Tidying Deep Saliency Prediction Architectures

Reddy

Jain

Yarlagadda

et al. 2020

View full text Add to dashboard Cite

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Jain¹,

Yarlagadda²,

Jyoti³

et al. 2020

Preprint

View full text Add to dashboard Cite

We propose the AViNet architecture for audiovisual saliency prediction. AViNet is a fully convolutional encoderdecoder architecture. The encoder combines visual features learned for action recognition, with audio embeddings learned via an aural network designed to classify objects and scenes. The decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining hierarchical features. The overall architecture is conceptually simple, causal, and runs in real-time (60 fps). AViNet outperforms the state-of-the-art on ten (seven audiovisual and three visual-only) datasets, while surpassing human performance on the CC, SIM and AUC metrics for the AVE dataset. Visual features maximally account for saliency on existing datasets with audio only contributing to minor gains, except in specific contexts like social events. Our work therefore motivates the need to curate saliency datasets reflective of real-life, where both the visual and aural modalities complimentarily drive saliency. Our code and pre-trained models are available at https: //github.com/samyak0210/VideoSaliency

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.