We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perceptron (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5× smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.
Khot et al. [13] Ours (Unsupervised) Ours (Self-supervised) Figure 1: Point cloud reconstructed by existing unsupervised MVS network [13] and our methods. Best view on screen.
M. Feroci et al.Abstract High-time-resolution X-ray observations of compact objects provide direct access to strong-field gravity, to the equation of state of ultradense matter and to black hole masses and spins. A 10 m 2 -class instrument in combination with good spectral resolution is required to exploit the relevant diagnostics and answer two of the fundamental questions of the European Space Agency (ESA) Cosmic Vision Theme "Matter under extreme conditions", namely: does matter orbiting close to the event horizon follow the predictions of general relativity? What is the equation of state of matter in neutron stars? The Large Observatory For X-ray Timing (LOFT), selected by ESA as one of the four Cosmic Vision M3 candidate missions to undergo an assessment phase, will revolutionise the study of collapsed objects in our galaxy and of the brightest supermassive black holes in active galactic nuclei. Thanks to an innovative design and the development of large-area monolithic silicon drift detectors, the Large Area Detector (LAD) on board LOFT will achieve an effective area of ∼12 m 2 (more than an order of magnitude larger than any spaceborne predecessor) in the 2-30 keV range (up to 50 keV in expanded mode), yet still fits a conventional platform and small/medium-class launcher. With this large area and a spectral resolution of <260 eV, LOFT will yield unprecedented information on strongly curved spacetimes and matter under extreme conditions of pressure and magnetic field strength.
Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objectness priors to generate foreground/background masks. Unfortunately these priors either require training pixel-level annotations/bounding boxes, or still yield inaccurate object boundaries. Here, we propose a novel method to extract markedly more accurate masks from the pre-trained network itself, forgoing external objectness modules. This is accomplished using the activations of the higher-level convolutional layers, smoothed by a dense CRF. We demonstrate that our method, based on these masks and a weakly-supervised loss, outperforms the state-of-the-art tag-based weakly-supervised semantic segmentation techniques. Furthermore, we introduce a new form of inexpensive weak supervision yielding an additional accuracy boost.
This work presents an unsupervised learning based approach to the ubiquitous computer vision problem of image matching. We start from the insight that the problem of frame-interpolation implicitly solves for inter-frame correspondences. This permits the application of analysisby-synthesis: we firstly train and apply a Convolutional Neural Network for frame-interpolation, then obtain correspondences by inverting the learned CNN. The key benefit behind this strategy is that the CNN for frame-interpolation can be trained in an unsupervised manner by exploiting the temporal coherency that is naturally contained in real-world video sequences. The present model therefore learns image matching by simply "watching videos". Besides a promise to be more generally applicable, the presented approach achieves surprising performance comparable to traditional empirically designed methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.