Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformers (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance compared with CNNs when trained from scratch on a midsize dataset (e.g., ImageNet). We find it is because: 1) the simple tokenization of input images fails to model the important local structure (e.g., edges, lines) among neighboring pixels, leading to its low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness in fixed computation budgets and limited training samples.To overcome such limitations, we propose a new Tokens-To-Token Vision Transformers (T2T-ViT), which introduces 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure presented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformers motivated by CNN architecture design after extensive study. Notably, T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% accuracy on ImageNet. 1
Abstract-Recently, significant improvement has been made on semantic object segmentation due to the development of deep convolutional neural networks (DCNNs). Training such a DCNN usually relies on a large number of images with pixel-level segmentation masks, and annotating these images is very costly in terms of both finance and human effort. In this paper, we propose a simple to complex (STC) framework in which only image-level annotations are utilized to learn DCNNs for semantic segmentation. Specifically, we first train an initial segmentation network called Initial-DCNN with the saliency maps of simple images (i.e., those with a single category of major object(s) and clean background). These saliency maps can be automatically obtained by existing bottom-up salient object detection techniques, where no supervision information is needed. Then, a better network called Enhanced-DCNN is learned with supervision from the predicted segmentation masks of simple images based on the Initial-DCNN as well as the image-level annotations. Finally, more pixel-level segmentation masks of complex images (two or more categories of objects with cluttered background), which are inferred by using Enhanced-DCNN and image-level annotations, are utilized as the supervision information to learn the Powerful-DCNN for semantic segmentation. Our method utilizes 40K simple images from Flickr.com and 10K complex images from PASCAL VOC for step-wisely boosting the segmentation network. Extensive experimental results on PASCAL VOC 2012 segmentation benchmark well demonstrate the superiority of the proposed STC framework compared with other state-of-the-arts.
In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies, and design a novel Octave Convolution (OctConv) operation 1 to store and process feature maps that vary spatially "slower" at a lower spatial resolution reducing both memory and computation cost. Unlike existing multi-scale methods, OctConv is formulated as a single, generic, plug-andplay convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing convolutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memory and computational cost. An OctConv-equipped ResNet-152 can achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2 GFLOPs.
Globally modeling and reasoning over relations between regions can be beneficial for many computer vision tasks on both images and videos. Convolutional Neural Networks (CNNs) excel at modeling local relations by convolution operations, but they are typically inefficient at capturing global relations between distant regions and require stacking multiple convolution layers. In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed. After reasoning, relation-aware features are distributed back to the original coordinate space for down-stream tasks. We further present a highly efficient instantiation of the proposed approach and introduce the Global Reasoning unit (GloRe unit) that implements the coordinate-interaction space mapping by weighted global pooling and weighted broadcasting, and the relation reasoning via graph convolution on a small graph in interaction space. The proposed GloRe unit is lightweight, end-to-end trainable and can be easily plugged into existing CNNs for a wide range of tasks. Extensive experiments show our GloRe unit can consistently boost the performance of state-of-the-art backbone architectures, including ResNet [15,16], ResNeXt [33], SE-Net [18] and DPN [9], for both 2D and 3D CNNs, on image classification, semantic segmentation and video action recognition task.
The spin-orbital interaction in heavy nonmagnetic metal/ferromagnetic metal bilayer systems has attracted great attention and exhibited promising potentials in magnetic logic devices, where the magnetization direction is controlled by passing an electric current. It is found that the spin-orbital interaction induces both an effective field and torque on the magnetization, which have been attributed to two different origins: the Rashba effect and the spin Hall effect. It requires quantitative analysis to distinguish the two mechanisms. Here we show sensitive spin-orbital effective field measurements up to 10 nm thick ferromagnetic layer and find the effective field rapidly diminishes with the increase of the ferromagnetic layer thickness. We further show that this effective field persists even with the insertion of a copper spacer. The nonlocal measurement suggests that the spin-orbital effective field does not rely on the heavy normal metal/ferromagnetic metal interface.
In this paper, we aim to reduce the computational cost of spatio-temporal deep neural networks, making them run as fast as their 2D counterparts while preserving state-of-the-art accuracy on video recognition benchmarks. To this end, we present the novel Multi-Fiber architecture that slices a complex neural network into an ensemble of lightweight networks or fibers that run through the network. To facilitate information flow between fibers we further incorporate multiplexer modules and end up with an architecture that reduces the computational cost of 3D networks by an order of magnitude, while increasing recognition performance at the same time. Extensive experimental results show that our multi-fiber architecture significantly boosts the efficiency of existing convolution networks for both image and video recognition tasks, achieving state-of-the-art performance on UCF-101, HMDB-51 and Kinetics datasets. Our proposed model requires over 9× and 13× less computations than the I3D [1] and R(2+1)D [2] models, respectively, yet providing higher accuracy.
The depletion of traditional energy resourcesas well as the desire to reduce high CO 2 emissions associated with their use has led to significant interest in developing sustainable and clean energy products, [1][2][3][4] such as electricity produced from wind-or solar-based technologies. Because of the intermittent availabilityof these resources the realization of their full potential will also require the development of new and advanced energy-storage and delivery systems. Supercapacitors, as a new class of energy storage devices, are now attracting intensive attention [2] because of their ability to store energy comparable to certain types of batteries, but with the advantage of delivering the stored energy much more rapidly than batteries.[3] This property makes supercapacitor ideal to augment traditional batteries in many different applications. However, to become primary devices for power supply, supercapacitors must be developed further to improve their abilities to deliver, simultaneously, high energy and power. [5] To realize this objective, nanostructured electrodes have been developed from a variety of different functional materials.[6-10] Despite significant progress, however, most of the processes for the fabrication of electrodes are either too delicate, [11][12][13][14] which makes them less viable for large-scale industrial applications, or require additives, [15,16] which deteriorates the performance of the electrodes. In addition, previously reported electrode materials with the desirable specific capacitance typically show high resistances, [9,13,17] which not only restrict the power performance but also prevent the utilization of thick electrodes. Based on these considerations, the goal of the present work was to build an advanced supercapacitor electrode using a simple and scalable fabrication technique and to optimize the electrode performance using a controlled functional material and a well-defined electrode network with minimum resistivity. First, Ni nanoparticles were synthesized using a modified polyol process.[18] After a simple mechanical compaction of the as-prepared (AP) nanoparticles and a subsequent lowtemperature annealing process, monolithic and mechanically robust, stable, and low-resistivity NiO/Ni nanoporous composite electrodes were obtained with both maximized energy and power densities.The structure of the AP Ni particles was characterized by X-ray diffraction (XRD; Figure 1 a) and electron diffraction (ED; Figure 1 b). The particle size estimated from the Scherrer method was 4.4 nm, and several particles formed larger aggregates with a diameter smaller than 20 nm (Figure 1 b); these findings are in agreement with the measured Brunauer-Emmett-Teller (BET) surface area of 40 m 2 g À1 . The AP particles were then mechanically compacted into monolithic pellets and used as prototype electrodes. These pellets are stable, easy to handle, and did neither require additives nor a supporting substrate. Scanning electron microscopic images obtained at both the surface and crosssection...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.