Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-ofthe-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone. All models and codes will be publicly available.
In this paper, we propose a novel image inpainting framework that takes advantage of holistic and structure information of the broken input image. Different from the existing models that complete the broken pictures using the holistic features of the input, our method adopts Patch-generative adversarial networks (GANs) equipped with multi-scale discriminators and edge process function to extract holistic, structured features, and restore the damaged images. After pre-training our Patch-GANs, the proposed network encourages our generator to find the best encoding of the broken input images in the latent space using a combination of a reconstruction loss, an edge loss, and global and local guidance losses. Besides, the reconstruction and the global guidance losses ensure the pixel reliability of the generated images, and the remaining losses guarantee the contents consistency between the local and global parts. The qualitative and quantitative experiments on multiple public datasets show that our approach has the ability to produce more realistic images compared with some existing methods, demonstrating the effectiveness and superiority of our method. INDEX TERMS Image inpainting, Patch-GANs, multi-scale discriminators.
Neural Architecture Search (NAS), aiming at automatically designing network architectures by machines, is hoped and expected to bring about a new revolution in machine learning. Despite these high expectation, the effectiveness and efficiency of existing NAS solutions are unclear, with some recent works going so far as to suggest that many existing NAS solutions are no better than random architecture selection. The inefficiency of NAS solutions may be attributed to inaccurate architecture evaluation. Specifically, to speed up NAS, recent works have proposed undertraining different candidate architectures in a large search space concurrently by using shared network parameters; however, this has resulted in incorrect architecture ratings and furthered the ineffectiveness of NAS.In this work, we propose to modularize the large search space of NAS into blocks to ensure that the potential candidate architectures are fully trained; this reduces the representation shift caused by the shared parameters and leads to the correct rating of the candidates. Thanks to the blockwise search, we can also evaluate all of the candidate architectures within a block. Moreover, we find that the knowledge of a network model lies not only in the network parameters but also in the network architecture. Therefore, we propose to distill the neural architecture (DNA) knowledge from a teacher model as the supervision to guide our block-wise architecture search, which significantly improves the effectiveness of NAS. Remarkably, the capacity of our searched architecture has exceeded the teacher model, demonstrating the practicability and scalability of our method. Finally, our method achieves a state-of-theart 78.4% top-1 accuracy on ImageNet in a mobile setting, which is about a 2.1% gain over EfficientNet-B0. All of our searched models along with the evaluation code are available at https://github.com/jiefengpeng/DNA. * Changlin Li and Jiefeng Peng contribute equally and share firstauthorship. This work was done when Changlin Li worked as an intern.† Corresponding Author is Guangrun Wang. conceptualized as analogous to the ventral visual blocks V1, V2, V4, and IT [28]. Then, we search for the candidate architectures (denoted by different shapes and paths) block-wisely guided by the architecture knowledge distilled from a teacher model.
Recent work in unsupervised image-to-image translation by adversarially learning mapping between different domains, which cannot distinguish the foreground and background. The existing methods of image-to-image translation mainly transfer the global image across the source and target domains. However, it is evident that not all regions of images should be transferred because forcefully transferring the unnecessary part leads to some unrealistic translations. In this paper, we present a positional attention bi-flow generative network, focusing our translation model on an interesting region or object in the image. We assume that the image representation can be decomposed into three parts: image-content, image-style, and imageposition features. We apply an encoder to extract these features and bi-flow generator with attention module to achieve the translation task in an end-to-end manner. To realize the object-level translation, we adopt the image-position features to label the common interesting region between the source and target domains. We analyze the proposed framework and provide qualitative and quantitative comparisons. The extensive experiments validate that our proposed model is qualified to accomplish the object-level translation and obtain compelling results with other state-of-the-art approaches. INDEX TERMS Image-to-image translation, attention mechanism, GANs.
Image‐to‐image translation is a class of vision and graphics problems where the goal is to learn the mapping between input images and output images. However, due to the unstable training and limited training samples, many existing GAN‐based works have difficulty in producing photo‐realistic images. Herein, dual‐directional generative adversarial networks are proposed, which consist of four adversarial networks, to produce images of high perceptual quality. In this framework, self‐reconstruction strategy is used to construct auxiliary sub‐networks, which impose more effective constraints on encoder‐generator pairs. Using this idea, this model can increase the use ratio of paired data conditioned on the same dataset and obtain well‐trained encoder‐generator pairs with the help of the proposed cross‐network skip connections. Moreover, the proposed framework not only produces realistic images but also addresses the problem where condition GAN produces sharp images containing many small, hallucinated objects. Training on multiple supervised datasets, convincing evidences are shown to prove that this model can achieve compelling results by latently learning a common feature representation. Qualitative and quantitative comparisons against other methods, demonstrate the effectiveness and superiority of the method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.