INTERN: A New Learning Paradigm Towards General Vision

Shao, Jing; Chen, Siyu; Li, Yangguang; Wang, Kun; Yin, Zhenfei; He, Yakun; Teng, Jianing; Sun, Qinglin; Gao, Mengya; Liu, Jihao; Huang, Gengshi; Song, Guanglu; Wu, Yichao; Huang, Yuming; Liu, Fenggang; Peng, Huan; Qin, Shuo; Wang, Chengyu; Wang, Yujie; He, Conghui; Ding, Liang; Liu, Yu; Yu, Fei; Yan, Junjie; Lin, Dahua; Wang, Xiaogang; Qiao, Yu

doi:10.48550/arxiv.2111.08687

Cited by 7 publications

(13 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent studies have demonstrated that TOV model trained by contrastive self-supervised learning with mass unlabeled nature images has impressive generalizability, which perform comparably well or even better than supervised learning methods across various computer vision tasks [11,16,21]. However, we experimentally find that directly using this pipeline to train TOV model for RSIU cannot obtain desired results.…”

Section: Training Tov Model For Rsiu Based On a Human-like Ssl Mechanismmentioning

confidence: 73%

“…Unlike the machine vision that is "taught" by labeled data, human-like vision is achieved by holistic and joint models that can simultaneously solve realworld problems by unsupervised way [11]. The key reason is that human visual recognition system is not limited to a specific task or specific dataset, and human language based labels are not the prerequisite for constructing the human visual system.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

TOV: The Original Vision Model for Optical Remote Sensing Image Understanding via Self-supervised Learning

Tao¹,

Qia²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Do we on the right way for remote sensing image understanding (RSIU) by training models via supervised data-dependent and task-dependent way, instead of human vision in a label-free and task-independent way? We argue that a more desirable RSIU model should be trained with intrinsic structure from data rather that extrinsic human labels to realize generalizability across a wide range of RSIU tasks. According to this hypothesis, we proposed The Original Vision model (TOV) in remote sensing filed. Trained by massive unlabeled optical data along a human-like self-supervised learning (SSL) path that is from general knowledge to specialized knowledge, TOV model can be easily adapted to various RSIU tasks, including scene classification, object detection, and semantic segmentation, and outperforms dominant ImageNet supervised pretrained method as well as two recently proposed SSL pretrained methods on majority of 12 publicly available benchmarks. Moreover, we analyze the influences of two key factors on the performance of building TOV model for RSIU, including the influence of using different data sampling methods and the selection of learning paths during self-supervised optimization. We believe that a general model which is trained by a label-free and task-independent way may

show abstract

Section: Training Tov Model For Rsiu Based On a Human-like Ssl Mechanismmentioning

confidence: 73%

Section: Introductionmentioning

confidence: 99%

TOV: The Original Vision Model for Optical Remote Sensing Image Understanding via Self-supervised Learning

Tao¹,

Qia²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recent progress has shown a great interest in general-purpose models [31,21,34,20,52,1] which can deal with a wide variety of input modalities and output tasks. Previous works [31,21] train models with a huge amount of image-text pairs by matching images to their captions.…”

Section: General-purpose Modelsmentioning

confidence: 99%

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

Chen¹,

Li²,

Bai³

et al. 2022

Preprint

View full text Add to dashboard Cite

Exploiting a general-purpose neural architecture to replace hand-wired designs or inductive biases has recently drawn extensive interest. However, existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection, hindering the tracking development in a more general system. This paper presents a Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction. Unlike existing Siamese trackers, we serialize the input images and concatenate them directly before the one-branch backbone. Feature interaction in the backbone helps to remove well-designed interaction modules and produce a more efficient and effective framework. To reduce the information loss from down-sampling in vision transformers, we further propose a foveal window strategy, providing more diverse input patches with acceptable computational costs. Our SimTrack improves the baseline with 2.5%/2.6% AUC gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles.

show abstract

“…Interestingly CLIP can be even used in text-guided image generation task ( Style-CLIP [16]) and Embodied AI ( EmbCLIP [8]). CLIP has also contributed to the development of general vision [20]. Witnessing CLIP's active community and wide applications, we propose the first work to benchmark CLIP.…”

Section: Related Workmentioning

confidence: 99%

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Cui¹,

Zhao²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is because researchers do not choose consistent training recipes and even use different data, hampering the fair comparison between different methods. In this work, we propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. We conduct a comprehensive analysis of three key factors: data, supervision, and model architecture. We find considerable intuitive or counter-intuitive insights: (1). Data quality has a significant impact on performance. (2). Certain supervision has different effects for Convolutional Networks (ConvNets) and Vision Transformers (ViT). Applying more proper supervision can effectively improve the performance of CLIP.(3). Curtailing the text encoder reduces the training cost but not much affect the final performance. Moreover, we further combine DeCLIP [9] with FILIP [30], bringing us the strongest variant DeFILIP. The CLIP-benchmark would be released at: https://github.com/Sense-GVT/ DeCLIP for future CLIP research.

show abstract

INTERN: A New Learning Paradigm Towards General Vision

Cited by 7 publications

References 37 publications

TOV: The Original Vision Model for Optical Remote Sensing Image Understanding via Self-supervised Learning

TOV: The Original Vision Model for Optical Remote Sensing Image Understanding via Self-supervised Learning

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Contact Info

Product

Resources

About