Eleonora Vig scite author profile

Modern computer vision algorithms typically require expensive data acquisition and accurate manual labeling. In this work, we instead leverage the recent progress in computer graphics to generate fully labeled, dynamic, and photo-realistic proxy virtual worlds. We propose an efficient real-to-virtual world cloning method, and validate our approach by building and publicly releasing a new video dataset, called "Virtual KITTI" 1 , automatically labeled with accurate ground truth for object detection, tracking, scene and instance segmentation, depth, and optical flow. We provide quantitative experimental evidence suggesting that (i) modern deep learning algorithms pre-trained on real data behave similarly in real and virtual worlds, and (ii) pre-training on virtual data improves performance. As the gap between real and virtual worlds is small, virtual worlds enable measuring the impact of various weather and imaging conditions on recognition performance, all other things being equal. We show these factors may affect drastically otherwise high-performing deep models for tracking.

show abstract

Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images

Vig

2014

View full text Add to dashboard Cite

Saliency prediction typically relies on hand-crafted (multiscale) features that are combined in different ways to form a "master" saliency map, which encodes local image conspicuity. Recent improvements to the state of the art on standard benchmarks such as MIT1003 have been achieved mostly by incrementally adding more and more hand-tuned features (such as car or face detectors) to existing models [18,4,22,34]. In contrast, we here follow an entirely automatic data-driven approach that performs a large-scale search for optimal features. We identify those instances of a richly-parameterized bio-inspired model family (hierarchical neuromorphic networks) that successfully predict image saliency. Because of the high dimensionality of this parameter space, we use automated hyperparameter optimization to efficiently guide the search. The optimal blend of such multilayer features combined with a simple linear classifier achieves excellent performance on several image saliency benchmarks. Our models outperform the state of the art on MIT1003, on which features and classifiers are learned. Without additional training, these models generalize well to two other image saliency data sets, Toronto and NUSEF, despite their different image content. Finally, our algorithm scores best of all the 23 models evaluated to date on the MIT300 saliency challenge [16], which uses a hidden test set to facilitate an unbiased comparison.

show abstract

Towards Multi-class Object Detection in Unconstrained Remote Sensing Imagery

et al. 2019

View full text Add to dashboard Cite

0000−0002−6084−2272] , Eleonora Vig 1[0000−0002−7015−6874] , Reza Bahmanyar 1[0000−0002−6999−714X] , Marco Körner 2[0000−0002−9186−4175] , and Peter Reinartz 1[0000−0002−8122−1475]Abstract. Automatic multi-class object detection in remote sensing images in unconstrained scenarios is of high interest for several applications including traffic monitoring and disaster management. The huge variation in object scale, orientation, category, and complex backgrounds, as well as the different camera sensors pose great challenges for current algorithms. In this work, we propose a new method consisting of a novel joint image cascade and feature pyramid network with multi-size convolution kernels to extract multi-scale strong and weak semantic features. These features are fed into rotation-based region proposal and region of interest networks to produce object detections. Finally, rotational non-maximum suppression is applied to remove redundant detections. During training, we minimize joint horizontal and oriented bounding box loss functions, as well as a novel loss that enforces oriented boxes to be rectangular. Our method achieves 68.16% mAP on horizontal and 72.45% mAP on oriented bounding box detection tasks on the challenging DOTA dataset, outperforming all published methods by a large margin (+6% and +12% absolute improvement, respectively). Furthermore, it generalizes to two other datasets, NWPU VHR-10 and UCAS-AOD, and achieves competitive results with the baselines even when trained on DOTA. Our method can be deployed in multi-class object detection applications, regardless of the image and object scales and orientations, making it a great choice for unconstrained aerial and satellite imagery.

show abstract

End-to-End Saliency Mapping via Probability Distribution Prediction

2016

View full text Add to dashboard Cite

Most saliency estimation methods aim to explicitly model low-level conspicuity cues such as edges or blobs and may additionally incorporate top-down cues using face or text detection. Data-driven methods for training saliency models using eye-fixation data are increasingly popular, particularly with the introduction of large-scale datasets and deep architectures. However, current methods in this latter paradigm use loss functions designed for classification or regression tasks whereas saliency estimation is evaluated on topographical maps. In this work, we introduce a new saliency map model which formulates a map as a generalized Bernoulli distribution. We then train a deep architecture to predict such maps using novel loss functions which pair the softmax activation function with measures designed to compute distances between probability distributions. We show in extensive experiments the effectiveness of such loss functions over standard ones on four public benchmark datasets, and demonstrate improved performance over state-of-the-art saliency methods.

show abstract

Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements

Vig¹,

Dörr

Cox³

2012

View full text Add to dashboard Cite

Algorithms using "bag of features"-style video representations currently achieve state-of-the-art performance on action recognition tasks, such as the challenging Hollywood2 benchmark [1,2,3]. These algorithms are based on local spatiotemporal descriptors that can be extracted either sparsely (at interest points) or densely (on regular grids), with dense sampling typically leading to the best performance [1]. Here, we investigate the benefit of space-variant processing of inputs, inspired by attentional mechanisms in the human visual system. We employ saliency-mapping algorithms to find informative regions and descriptors corresponding to these regions are either used exclusively, or are given greater representational weight (additional codebook vectors). This approach is evaluated with three state-of-the-art action recognition algorithms [1,2,3], and using several saliency algorithms. We also use saliency maps derived from human eye movements to probe the limits of the approach. Saliency-based pruning allows up to 70% of descriptors to be discarded, while maintaining high performance on Hollywood2. Meanwhile, pruning of 20-50% (depending on model) can even improve recognition. Further improvements can be obtained by combining representations learned separately on salience-pruned and unpruned descriptor sets. Not surprisingly, using the human eye movement data gives the best mean Average Precision (mAP; 61.9%), providing an upper bound on what is possible with a high-quality saliency map. Even without such external data, the Dense Trajectories model [2] enhanced by automated saliency-based descriptor sampling achieves the best mAP (60.0%) reported on Hollywood2 to date.

show abstract

Efficient visual coding and the predictability of eye movements on natural movies

2009

View full text Add to dashboard Cite

We deal with the analysis of eye movements made on natural movies in free-viewing conditions. Saccades are detected and used to label two classes of movie patches as attended and non-attended. Machine learning techniques are then used to determine how well the two classes can be separated, i.e. how predictable saccade targets are. Although very simple saliency measures are used and then averaged to obtain just one average value per scale, the two classes can be separated with an ROC score of around 0.7, which is higher than previously reported results. Moreover, predictability is analysed for different representations to obtain indirect evidence for the likelihood of a particular representation. It is shown that the predictability correlates with the local intrinsic dimension in a movie.

show abstract

Intrinsic Dimensionality Predicts the Saliency of Natural Dynamic Scenes

Vig

Dörr

Martinetz

et al. 2012

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Abstract-Since visual attention-based computer vision applications have gained popularity, ever more complex, biologicallyinspired models seem to be needed to predict salient locations (or interest points) in naturalistic scenes. In this paper, we explore how far one can go in predicting eye movements by using only basic signal processing, such as image representations derived from efficient coding principles, and machine learning. To this end, we gradually increase the complexity of a model from simple single-scale saliency maps computed on grayscale videos to spatio-temporal multiscale and multispectral representations. Using a large collection of eye movements on high-resolution videos, supervised learning techniques fine-tune the free parameters whose addition is inevitable with increasing complexity. The proposed model, although very simple, demonstrates significant improvement in predicting salient locations in naturalistic videos over four selected baseline models and two distinct data labelling scenarios.Index Terms-Computational models of vision, video analysis, computer vision, spatio-temporal saliency, eye movement prediction, intrinsic dimension, visual attention, interest point detection.

show abstract

Condition-invariant, top-down visual place recognition

Milford

Scheirer²,

Vig³

et al. 2014

View full text Add to dashboard Cite

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Eleonora Vig

VirtualWorlds as Proxy for Multi-object Tracking Analysis

Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images

Towards Multi-class Object Detection in Unconstrained Remote Sensing Imagery

End-to-End Saliency Mapping via Probability Distribution Prediction

Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements

Efficient visual coding and the predictability of eye movements on natural movies

Intrinsic Dimensionality Predicts the Saliency of Natural Dynamic Scenes

Condition-invariant, top-down visual place recognition

Contact Info

Product

Resources

About