Learning to Zoom: A Saliency-Based Sampling Layer for Neural Networks

Recasens, Adrià; Kellnhofer, Petr; Stent, Simon; Matusik, Wojciech; Torralba, Antonio

doi:10.1007/978-3-030-01240-3_4

Cited by 126 publications

(110 citation statements)

References 28 publications

Supporting

Mentioning

109

Contrasting

Order By: Relevance

“…However, 448 input increases the computational cost (i.e., flops) by four times compared to 224 input. SSN [22] obtains a better results than DT-RAM [19], and our TASN can further obtain 2.9% relative improvement. Such improvements mainly come from two aspects: 1) a better sampling mechanism considering spatial distortion (1.2%), and 2) a better fine-grained details optimizing strategy (1.7%).…”

Section: Evaluation and Analysis On Cub-200-2011mentioning

confidence: 81%

“…But without explicit guidance, it is hard to learn non-uniformed sampling parameters for sophisticated tasks such as fine-grained recognition, thus they finally learned two parts without non-uniformed sampling. SSN [22] firstly proposed to use saliency maps as the guidance of non-uniformed sampling and obtained significant improvements. Different from them, our attention sampler 1) conduct non-uniformed sampling based on trilinear attention maps, and 2) decomposes attention maps into two dimensions to reduce spatial distortion effects.…”

Section: Related Workmentioning

confidence: 99%

“…• DT-RAM [19]: Dynamic computational time model for recurrent visual attention, which attends on the most discriminative parts by dynamic steps. • SSN [22]: Saliency-based sampling networks, which conduct non-uniformed sampling based on saliency map in an end-to-end way. which is a novel self-supervision mechanism to effectively localize informative regions without the need of bounding-box/part annotations.…”

Section: Experiments Setupmentioning

confidence: 99%

See 2 more Smart Citations

Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition

Zheng

Zha

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

381

232

View full text Add to dashboard Cite

Learning subtle yet discriminative features (e.g., beak and eyes for a bird) plays a significant role in fine-grained image recognition. Existing attention-based approaches localize and amplify significant parts to learn fine-grained details, which often suffer from a limited number of parts and heavy computational cost. In this paper, we propose to learn such fine-grained features from hundreds of part proposals by Trilinear Attention Sampling Network (TASN) in an efficient teacher-student manner. Specifically, TASN consists of 1) a trilinear attention module, which generates attention maps by modeling the inter-channel relationships, 2) an attention-based sampler which highlights attended parts with high resolution, and 3) a feature distiller, which distills part features into an object-level feature by weight sharing and feature preserving strategies. Extensive experiments verify that TASN yields the best performance under the same settings with the most competitive approaches, in iNaturalist-2017, CUB-Bird, and Stanford-Cars datasets.

show abstract

Section: Evaluation and Analysis On Cub-200-2011mentioning

confidence: 81%

Section: Related Workmentioning

confidence: 99%

Section: Experiments Setupmentioning

confidence: 99%

See 1 more Smart Citation

Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition

Zheng

Zha

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

381

232

View full text Add to dashboard Cite

show abstract

“…Spatial Transformer Networks [28,42] learn spatial transformations (warping) of the CNN input. They explore different parameterizations for spatial transformation including affine, projective, splines [28] or specially designed saliency-based layers [42]. Their focus is to undo different data distortions or to "zoom-in" on salient regions, while our approach is focused on efficient downsampling retaining as much information around semantic boundaries as possible.…”

Section: Prior Workmentioning

confidence: 99%

Efficient Segmentation: Learning Downsampling Near Semantic Boundaries

Marin

Vajda

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Many automated processes such as auto-piloting rely on a good semantic segmentation as a critical component. To speed up performance, it is common to downsample the input frame. However, this comes at the cost of missed small objects and reduced accuracy at semantic boundaries. To address this problem, we propose a new content-adaptive downsampling technique that learns to favor sampling locations near semantic boundaries of target classes. Costperformance analysis shows that our method consistently outperforms the uniform sampling improving balance between accuracy and computational efficiency. Our adaptive sampling gives segmentation with better quality of boundaries and more reliable support for smaller-size objects.

show abstract

“…The work in [1] has demonstrated the advantages of foveated image processing with regard to improvements in computational efficiency (but did not address CNNs). In recent models of visual saliency using CNNs, images have been applied to networks using a foveal transform [2] [8]. However, those works did not investigate image size reduction and frame-rate speed-up, which is of critical importance for embedded systems.…”

Section: Arxiv:190809000v1 [Cscv] 15 Aug 2019mentioning

confidence: 99%

Foveated Image Processing for Faster Object Detection and Recognition in Embedded Systems Using Deep Convolutional Neural Networks

Jaramillo-Avila

Anderson

2019

Biomimetic and Biohybrid Systems

View full text Add to dashboard Cite

Object detection and recognition algorithms using deep convolutional neural networks (CNNs) tend to be computationally intensive to implement. This presents a particular challenge for embedded systems, such as mobile robots, where the computational resources tend to be far less than for workstations. As an alternative to standard, uniformly sampled images, we propose the use of foveated image sampling here to reduce the size of images, which are faster to process in a CNN due to the reduced number of convolution operations. We evaluate object detection and recognition on the Microsoft COCO database, using foveated image sampling at different image sizes, ranging from 416 × 416 to 96 × 96 pixels, on an embedded GPU -an NVIDIA Jetson TX2 with 256 CUDA cores. The results show that it is possible to achieve a 4× speed-up in frame rates, from 3.59 FPS to 15.24 FPS, using 416 × 416 and 128 × 128 pixel images respectively. For foveated sampling, this image size reduction led to just a small decrease in recall performance in the foveal region, to 92.0% of the baseline performance with full-sized images, compared to a significant decrease to 50.1% of baseline recall performance in uniformly sampled images, demonstrating the advantage of foveated sampling.

show abstract

Learning to Zoom: A Saliency-Based Sampling Layer for Neural Networks

Cited by 126 publications

References 28 publications

Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition

Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition

Efficient Segmentation: Learning Downsampling Near Semantic Boundaries

Foveated Image Processing for Faster Object Detection and Recognition in Embedded Systems Using Deep Convolutional Neural Networks

Contact Info

Product

Resources

About