Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Yehao, Li,; Fan, Jia-Hao; Pan, Yingwei; Yao, Ting; Lin, Weiyao; Mei, Tao

doi:10.1145/3473140

Cited by 7 publications

(4 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Deep learning (DL) [1], as one of the most popular machine learning methods driven by big data, has been widely studied and employed in various felds and different scenarios such as face detection [2], social networks [3,4], natural language process [5,6], speech technology [7][8][9], detection of network anomalies [10,11], and multimodal learning [12][13][14].…”

Section: Introductionmentioning

confidence: 99%

Defending Privacy Inference Attacks to Federated Learning for Intelligent IoT with Parameter Compression

Zhu

Cao

Ren

et al. 2023

Security and Communication Networks

View full text Add to dashboard Cite

Federated learning has been popularly studied with people’s increasing awareness of privacy protection. It solves the problem of privacy leakage by its ability that allows many clients to train a collaborative model without uploading local data collected by Internet of Things (IoT) devices. However, there are still threats of privacy leakage in federated learning. The privacy inference attacks can reconstruct the privacy data of other clients based on GAN from the parameters in the process of iterations for global models. In this work, we are motivated to prevent GAN-based privacy inference attacks in federated learning. Inspired by the idea of gradient compression, we propose a defense method called Federated Learning Parameter Compression (FLPC) which can reduce the sharing of information for privacy protection. It prevents attackers from recovering the private information of victims while maintaining the accuracy of the global model. Extensive experimental results demonstrated that our method is effective in the prevention of GAN-based privacy inferring attacks. In addition, based on the experimental results, we propose a norm-based metric to assess the performance of privacy-preserving.

show abstract

Section: Introductionmentioning

confidence: 99%

Defending Privacy Inference Attacks to Federated Learning for Intelligent IoT with Parameter Compression

Zhu

Cao

Ren

et al. 2023

Security and Communication Networks

View full text Add to dashboard Cite

show abstract

“…Recent strides in vision-language pretraining have exerted a profound impact on image captioning research [28][29][30]. Zhou et al [28] present a unified vision-language pre-training (VLP) model for image captioning, employing a Transformer network for both encoding and decoding, with pre-training on large image-text pairs.…”

Section: Vision-language Pre-training Advancementsmentioning

confidence: 99%

“…This novel approach, leveraging textual augmentation, demonstrates improved performance in various vision-language tasks, notably in image captioning, by refining representation quality and model convergence. Li et al [30] introduce Uni-EDEN, a Universal Encoder-Decoder Network for vision-language tasks, focusing on multi-granular vision-language pre-training. This approach notably enhances multimodal reasoning and language modeling capabilities, advancing both perception and generation aspects in image captioning.…”

Section: Vision-language Pre-training Advancementsmentioning

confidence: 99%

A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning

Peng,

Tang

2024

Applied Sciences

View full text Add to dashboard Cite

Image captioning, also recognized as the challenge of transforming visual data into coherent natural language descriptions, has persisted as a complex problem. Traditional approaches often suffer from semantic gaps, wherein the generated textual descriptions lack depth, context, or the nuanced relationships contained within the images. In an effort to overcome these limitations, we introduce a novel encoder–decoder framework called A Unified Visual and Linguistic Semantics Method. Our method comprises three key components: an encoder, a mapping network, and a decoder. The encoder employs a fusion of CLIP (Contrastive Language–Image Pre-training) and SegmentCLIP to process and extract salient image features. SegmentCLIP builds upon CLIP’s foundational architecture by employing a clustering mechanism, thereby enhancing the semantic relationships between textual and visual elements in the image. The extracted features are then transformed by a mapping network into a fixed-length prefix. A GPT-2-based decoder subsequently generates a corresponding Chinese language description for the image. This framework aims to harmonize feature extraction and semantic enrichment, thereby producing more contextually accurate and comprehensive image descriptions. Our quantitative assessment reveals that our model exhibits notable enhancements across the intricate AIC-ICC, Flickr8k-CN, and COCO-CN datasets, evidenced by a 2% improvement in BLEU@4 and a 10% uplift in CIDEr scores. Additionally, it demonstrates acceptable efficiency in terms of simplicity, speed, and reduction in computational burden.

show abstract

“…Owing to successful applications of pre-training methods in NLP [7,43] and CV [5,22], more and more researchers attempt to explore this "Pre-training & Finetuning" paradigm in the video-text field [25,33], which has achieved remarkable performance gain in various downstream video understanding tasks, such as video-text re- trieval [10,38,53], video question answering [44,55,59], and video reasoning [6,15,42,54,57]. There are two mainstream paradigms in current video-text pre-training methods: the feature-level paradigm and the pixel-level one.…”

Section: Introductionmentioning

confidence: 99%

SNP-S³: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks

Dong

Guo²,

Gan

et al. 2024

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several "significant words" when understanding a sentence, we propose the Significant Semantic Strengthening (S 3 ) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream videotext tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pretraining efficiency and the fine-tuning performance. The codebase are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3 vtp.

show abstract

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Cited by 7 publications

References 44 publications

Defending Privacy Inference Attacks to Federated Learning for Intelligent IoT with Parameter Compression

Defending Privacy Inference Attacks to Federated Learning for Intelligent IoT with Parameter Compression

A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning

SNP-S³: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks

Contact Info

Product

Resources

About

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Cited by 7 publications

References 44 publications

Defending Privacy Inference Attacks to Federated Learning for Intelligent IoT with Parameter Compression

Defending Privacy Inference Attacks to Federated Learning for Intelligent IoT with Parameter Compression

A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning

SNP-S3: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks

Contact Info

Product

Resources

About

SNP-S³: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks