VLP: A Survey on Vision-language Pre-training

Chen, Feilong; Zhang, Duzhen; Han, Minglun; Chen, Xiuyi; Shi, Jing; Xu, Shuang; Xu, Bo

doi:10.1007/s11633-022-1369-5

Cited by 86 publications

(20 citation statements)

References 111 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We will briefly review existing ideas and methods that are highly related to our work. For more details, please refer to previous survey papers [1], [2], [27], [28], [29].…”

Section: Related Workmentioning

confidence: 99%

Multi-granularity Prediction for Scene Text Recognition

Wang

Cheng

Yao

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Due to the enormous technical challenges and wide range of applications, scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this tough problem, numerous innovative methods have been successively proposed, and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet functionally powerful vision STR model, which is built upon ViT and a tailored Adaptive Addressing and Aggregation (A 3 ) module. It already outperforms most previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. To produce the final recognition results, two strategies for effectively fusing the multi-granularity predictions are devised: the first strategy, called Confidence-based Fusion Strategy (CFS), employs a straightforward rule based on the confidence scores of the multi-granularity predictions, while the second one, called Learnable Fusion Strategy (LFS), is realized with a trainable module to directly measure the cross-modal similarities between text (predicted words) and images, akin to CLIP. The resultant algorithm (termed MGP-STR) is able to push the performance envelope of STR to an even higher level. Specifically, MGP-STR with LFS achieves an average recognition accuracy of 94% on standard benchmarks for scene text recognition (such as IIIT 5K-word, ICDAR 2015, SVT and CUTE). Moreover, it also achieves state-of-the-art results on widely-used handwritten benchmarks (IAM, CVL and RIMES) as well as more challenging scene text datasets (ArT, COCO-Text and Uber-Text), demonstrating the generality of the proposed MGP-STR algorithm. The source code and models will be available at: https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR.

show abstract

“…We will briefly review existing ideas and methods that are highly related to our work. For more details, please refer to previous survey papers [1], [2], [27], [28], [29].…”

Section: Related Workmentioning

confidence: 99%

Multi-granularity Prediction for Scene Text Recognition

Wang

Cheng

Yao

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…For instance, Li et al [61] shared advances on vision-language tasks, including VLM pretraining for various task-specific methods. Du et al [62] and Chen et al [63] reviewed VLM pre-training for visionlanguage tasks [57], [58], [60]. Xu et al [64] and Wang et al [65] shared recent progress of multi-modal learning on multi-modal tasks (e.g., language, vision and auditory modalities).…”

Section: Relevant Surveysmentioning

confidence: 99%

Causal Attention for Vision-Language Tasks

Zhang

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

101

View full text Add to dashboard Cite

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; ( 6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM survey.

show abstract

“…Since 2014, the ascendancy of deep learning techniques has reverberated in cross-modal retrieval, harnessing the potency of deep neural networks to autonomously glean high-level feature representations from multi-modal data [5]. In recent years, a cascade of crossmodal retrieval approaches has been tailored to diverse open scenarios, harnessing the potential of vision-language pretraining models [6]. These strides have notably bolstered the precision, robustness, and scalability of cross-modal retrieval systems by infusing sophisticated learning models and training strategies.…”

Section: Introductionmentioning

confidence: 99%

Efficient Query-based Black-box Attack against Cross-modal Hashing Retrieval

Zhu

Wang

et al. 2023

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

Deep cross-modal hashing retrieval models inherit the vulnerability of deep neural networks. They are vulnerable to adversarial attacks, especially for the form of subtle perturbations to the inputs. Although many adversarial attack methods have been proposed to handle the robustness of hashing retrieval models, they still suffer from two problems: 1) Most of them are based on the white-box settings, which is usually unrealistic in practical application. 2) Iterative optimization for the generation of adversarial examples in them results in heavy computation. To address these problems, we propose an Efficient Query-based Black-Box Attack (EQB 2 A) against deep cross-modal hashing retrieval, which can efficiently generate adversarial examples for the black-box attack. Specifically, by sending a few query requests to the attacked retrieval system, the cross-modal retrieval model stealing is performed based on the neighbor relationship between the retrieved results and the query, thus obtaining the knockoffs to substitute the attacked system. A multi-modal knockoffs-driven adversarial generation is proposed to achieve efficient adversarial example generation. While the entire network training converges, EQB 2 A can efficiently generate adversarial examples by forward-propagation with only given benign images. Experiments show that EQB 2 A achieves superior attacking performance under the black-box setting.

show abstract

VLP: A Survey on Vision-language Pre-training

Cited by 86 publications

References 111 publications

Multi-granularity Prediction for Scene Text Recognition

Multi-granularity Prediction for Scene Text Recognition

Causal Attention for Vision-Language Tasks

Efficient Query-based Black-box Attack against Cross-modal Hashing Retrieval

Contact Info

Product

Resources

About