Unifying Multimodal Transformer for Bi-directional Image and Text Generation

Huang, Yupan; Xue, Haizhou; Liu, Bei; Lu, Yutong

doi:10.1145/3474085.3481540

Cited by 35 publications

(21 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use Fréchet Inception Distance (FID) [105] and Inception Score (IS) [106] to evaluate the quality of the images. Following the previous studies [50,74], we also OFA model for 50, 000 steps with a batch size of 256. The learning rate is 3e − 5 with the label smoothing of 0.1, and the maximum text sequence length is set to 512.…”

Section: Referring Expression Comprehensionmentioning

confidence: 99%

See 1 more Smart Citation

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Wang¹,

An²,

Men³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-tosequence learning framework based on the encoder-decoder architecture. OFA performs pretraining and finetuning with task instructions and introduces no extra task-specific layers for finetuning. Experimental results show that OFA achieves new state-of-the-arts on a series of multimodal tasks, including image captioning (COCO test CIDEr: 149.6), text-to-image generation (COCO test FID: 10.5), VQA (test-std acc.: 80.02), SNLI-VE (test acc.: 90.20), and referring expression comprehension (RefCOCO / RefCOCO+ / RefCOCOg test acc.: 92.93 / 90.10 / 85.20). Through extensive analyses, we demonstrate that OFA reaches comparable performance with uni-modal pretrained models (e.g., BERT, MAE, MoCo v3, SimCLR v2, etc.) in uni-modal tasks, including NLU, NLG, and image classification, and it effectively transfers to unseen tasks and domains. Code shall be released soon at https://github.com/OFA-Sys/OFA.

show abstract

Section: Referring Expression Comprehensionmentioning

confidence: 99%

“…The model is first finetuned with cross-entropy and then with CLIPSIM optimization following [74,107]. In the first stage, we finetune the OFA model for about 50 epochs with a batch size of 512 and a learning rate of 1e − 3.…”

Section: Natural Language Generationmentioning

confidence: 99%

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Wang¹,

An²,

Men³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Several studies have been conducted on bidirectional translation tasks. With regard to the bidirectional text-to/fromimage generation task, there are some studies that apply pretrained models trained with large-scale data [8]. During the conversion between symbolic actions/states and texts in a 3D simulator environment [5], pre-training on large non-paired data has shown an improved performance in zero-shot settings.…”

Section: A Utilization Of Pre-trained Model In Translation Tasksmentioning

confidence: 99%

Learning Bidirectional Translation between Descriptions and Actions with Small Paired Data

Toyoda,

Suzuki,

Hayashi

et al. 2022

Preprint

View full text Add to dashboard Cite

This study achieved bidirectional translation between descriptions and actions using small paired data. The ability to mutually generate descriptions and actions is essential for robots to collaborate with humans in their daily lives. The robot is required to associate real-world objects with linguistic expressions, and large-scale paired data are required for machine learning approaches. However, a paired dataset is expensive to construct and difficult to collect. This study proposes a two-stage training method for bidirectional translation. In the proposed method, we train recurrent autoencoders (RAEs) for descriptions and actions with a large amount of non-paired data. Then, we fine-tune the entire model to bind their intermediate representations using small paired data. Because the data used for pre-training do not require pairing, behavior-only data or a large language corpus can be used. We experimentally evaluated our method using a paired dataset consisting of motion-captured actions and descriptions. The results showed that our method performed well, even when the amount of paired data to train was small. The visualization of the intermediate representations of each RAE showed that similar actions were encoded in a clustered position and the corresponding feature vectors well aligned.

show abstract

“…Among various methods to tackle two separate tasks, self-attention has been the mainstream method that offers a unified framework to encode (and decode) both text and image data naturally [16][17][18]25,35,39]. Some studies aimed to perform both generation tasks with a single model [5,12], where K-means clustering was used to discretize the image features, leading to suboptimal image generation performance than using image quantization. In this work, to investigate the effectiveness of translation equivariance in the quantized space, we used the DALL-E architecture to conduct both I → T and T → I generation tasks.…”

Section: Related Workmentioning

confidence: 99%

Exploration into Translation-Equivariant Image Quantization

Shin¹,

Lee²,

Lee³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, vector-quantized image modeling has demonstrated impressive performance on generation tasks such as text-to-image generation. However, we discover that the current image quantizers do not satisfy translation equivariance in the quantized space due to aliasing, degrading performance in the downstream text-to-image generation and image-to-text generation, even in simple experimental setups. Instead of focusing on anti-aliasing, we take a direct approach to encourage translation equivariance in the quantized space. In particular, we explore a desirable property of image quantizers, called 'Translation Equivariance in the Quantized Space' and propose a simple but effective way to achieve translation equivariance by regularizing orthogonality in the codebook embedding vectors. Using this method, we improve accuracy by +22% in text-to-image generation and +26% in image-to-text generation, outperforming the VQGAN.

show abstract

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

Cited by 35 publications

References 47 publications

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Learning Bidirectional Translation between Descriptions and Actions with Small Paired Data

Exploration into Translation-Equivariant Image Quantization

Contact Info

Product

Resources

About