End-to-End Saliency Mapping via Probability Distribution Prediction

Jetley, Saumya; Murray, Naila; Vig, Eleonora

doi:10.1109/cvpr.2016.620

Cited by 133 publications

(106 citation statements)

References 35 publications

(57 reference statements)

Supporting

Mentioning

100

Contrasting

Unclassified

Order By: Relevance

“…As for loss function, most of the existing DCNN-based saliency models directly use the typical pixel-wise classification or regression loss functions whereas saliency prediction is evaluated on the whole saliency maps. In [27], Jetley et al propose to use loss functions based on statistical distances with softmax normalization for training saliency models. Their results demonstrate the improvement by considering saliency maps as probability distributions.…”

Section: A Deep Learning-based Visual Saliency Predictionmentioning

confidence: 99%

See 1 more Smart Citation

A Dilated Inception Network for Visual Saliency Prediction

Yang

Lin

Jiang

et al. 2020

IEEE Trans. Multimedia

123

View full text Add to dashboard Cite

Recently, with the advent of deep convolutional neural networks (DCNN), the improvements in visual saliency prediction research are impressive. One possible direction to approach the next improvement is to fully characterize the multiscale saliency-influential factors with a computationally-friendly module in DCNN architectures. In this work, we proposed an end-to-end dilated inception network (DINet) for visual saliency prediction. It captures multi-scale contextual features effectively with very limited extra parameters. Instead of utilizing parallel standard convolutions with different kernel sizes as the existing inception module, our proposed dilated inception module (DIM) uses parallel dilated convolutions with different dilation rates which can significantly reduce the computation load while enriching the diversity of receptive fields in feature maps. Moreover, the performance of our saliency model is further improved by using a set of linear normalization-based probability distribution distance metrics as loss functions. As such, we can formulate saliency prediction as a probability distribution prediction task for global saliency inference instead of a typical pixel-wise regression problem. Experimental results on several challenging saliency benchmark datasets demonstrate that our DINet with proposed loss functions can achieve state-of-the-art performance with shorter inference time.

show abstract

Section: A Deep Learning-based Visual Saliency Predictionmentioning

confidence: 99%

“…In order to convert the predicted saliency map and its corresponding ground-truth into probability distributions, a normalization method should be applied first. Here, we improve the existing method [27] by replacing their softmax normalization with a simple linear regularization.…”

Section: Loss Functionmentioning

confidence: 99%

A Dilated Inception Network for Visual Saliency Prediction

Yang

Lin

Jiang

et al. 2020

IEEE Trans. Multimedia

123

View full text Add to dashboard Cite

show abstract

“…Visual Saliency: The early CNN-based approaches for saliency were based on the adaptation of pretrained CNN models for visual recognition tasks [39,58]. Later, in [45] both shallow and deep CNN were trained end-to-end for saliency prediction while [28,29] trained the networks by optimizing common saliency evaluation metrics. In [44] the authors employed end-to-end Generative Adversarial Networks (GAN), while [62] has utilized multi-level saliency information from different layer through skip connections.…”

Section: Related Workmentioning

confidence: 99%

SUSiNet: See, Understand and Summarize It

Koutras

Maragos

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

In this work we propose a multi-task spatio-temporal network, called SUSiNet, that can jointly tackle the spatiotemporal problems of saliency estimation, action recognition and video summarization. Our approach employs a single network that is jointly end-to-end trained for all tasks with multiple and diverse datasets related to the exploring tasks. The proposed network uses a unified architecture that includes global and task specific layer and produces multiple output types, i.e., saliency maps or classification labels, by employing the same video input. Moreover, one additional contribution is that the proposed network can be deeply supervised through an attention module that is related to human attention as it is expressed by eye-tracking data. From the extensive evaluation, on seven different datasets, we have observed that the multi-task network performs as well as the state-of-the-art single-task methods (or in some cases better), while it requires less computational budget than having one independent network per each task.

show abstract

“…Many CNN architectures have been proposed in the area of image classification [25,35,12,13,49], where a deep CNN model is able to achieve a higher accuracy for classification 1 . However, CNN models used in state-ofthe-art saliency applications are relatively shallow, such as VGGNet-16 [27,41,8,16,26,37,31] or ResNet-50 [29,9]. In the work of [14], the deeper model, GoogleNet, did not achieve better performance due to the limited training set.…”

Section: Introductionmentioning

confidence: 99%

EML-NET: An Expandable Multi-Layer NETwork for saliency prediction

Jia

Bruce

2020

Image and Vision Computing

119

View full text Add to dashboard Cite

Saliency prediction can benefit from training that involves scene understanding that may be tangential to the central task; this may include understanding places, spatial layout, objects or involve different datasets and their bias. One can combine models, but to do this in a sophisticated manner can be complex, and also result in unwieldy networks or produce competing objectives that are hard to balance. In this paper, we propose a scalable system to leverage multiple powerful deep CNN models to better extract visual features for saliency prediction. Our design differs from previous studies in that the whole system is trained in an almost end-to-end piece-wise fashion. The encoder and decoder components are separately trained to deal with complexity tied to the computational paradigm and required space. Furthermore, the encoder can contain more than one CNN model to extract features, and models can have different architectures or be pre-trained on different datasets. This parallel design yields a better computational paradigm overcoming limits to the variety of information or inference that can be combined at the encoder stage towards deeper networks and a more powerful encoding. Our network can be easily expanded almost without any additional cost, and other pre-trained CNN models can be incorporated availing a wider range of visual knowledge. We denote our expandable multi-layer network as EML-NET and our method achieves the state-of-the-art results on the public saliency benchmarks, SALICON, MIT300 and CAT2000.

show abstract

End-to-End Saliency Mapping via Probability Distribution Prediction

Cited by 133 publications

References 35 publications

A Dilated Inception Network for Visual Saliency Prediction

A Dilated Inception Network for Visual Saliency Prediction

SUSiNet: See, Understand and Summarize It

EML-NET: An Expandable Multi-Layer NETwork for saliency prediction

Contact Info

Product

Resources

About