Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3481540
|View full text |Cite
|
Sign up to set email alerts
|

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

Abstract: We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks. We adopt Transformer as our unified architecture for its strong performance and task-agnostic design. Specifically, we formulate b… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 35 publications
(21 citation statements)
references
References 47 publications
0
18
0
Order By: Relevance
“…We use Fréchet Inception Distance (FID) [105] and Inception Score (IS) [106] to evaluate the quality of the images. Following the previous studies [50,74], we also OFA model for 50, 000 steps with a batch size of 256. The learning rate is 3e − 5 with the label smoothing of 0.1, and the maximum text sequence length is set to 512.…”
Section: Referring Expression Comprehensionmentioning
confidence: 99%
See 1 more Smart Citation
“…We use Fréchet Inception Distance (FID) [105] and Inception Score (IS) [106] to evaluate the quality of the images. Following the previous studies [50,74], we also OFA model for 50, 000 steps with a batch size of 256. The learning rate is 3e − 5 with the label smoothing of 0.1, and the maximum text sequence length is set to 512.…”
Section: Referring Expression Comprehensionmentioning
confidence: 99%
“…The model is first finetuned with cross-entropy and then with CLIPSIM optimization following [74,107]. In the first stage, we finetune the OFA model for about 50 epochs with a batch size of 512 and a learning rate of 1e − 3.…”
Section: Natural Language Generationmentioning
confidence: 99%
“…Several studies have been conducted on bidirectional translation tasks. With regard to the bidirectional text-to/fromimage generation task, there are some studies that apply pretrained models trained with large-scale data [8]. During the conversion between symbolic actions/states and texts in a 3D simulator environment [5], pre-training on large non-paired data has shown an improved performance in zero-shot settings.…”
Section: A Utilization Of Pre-trained Model In Translation Tasksmentioning
confidence: 99%
“…Among various methods to tackle two separate tasks, self-attention has been the mainstream method that offers a unified framework to encode (and decode) both text and image data naturally [16][17][18]25,35,39]. Some studies aimed to perform both generation tasks with a single model [5,12], where K-means clustering was used to discretize the image features, leading to suboptimal image generation performance than using image quantization. In this work, to investigate the effectiveness of translation equivariance in the quantized space, we used the DALL-E architecture to conduct both I → T and T → I generation tasks.…”
Section: Related Workmentioning
confidence: 99%