2022
DOI: 10.48550/arxiv.2201.07520
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CM3: A Causal Masked Multimodal Model of the Internet

Abstract: We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative mode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
25
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 14 publications
(28 citation statements)
references
References 27 publications
0
25
0
Order By: Relevance
“…We compare the validation perplexity of the 6B parameter model and a smaller 1.3B parameter model (see Section 6.2 for details on the training of this 1.3B model) in Figure 2, showing comparable scaling laws to those reported by Aghajanyan et al [2] HumanEval Pass@1 We plot a line of best fit along with a 95% confidence interval via bootstrap resampling.…”
Section: Trainingmentioning
confidence: 59%
See 3 more Smart Citations
“…We compare the validation perplexity of the 6B parameter model and a smaller 1.3B parameter model (see Section 6.2 for details on the training of this 1.3B model) in Figure 2, showing comparable scaling laws to those reported by Aghajanyan et al [2] HumanEval Pass@1 We plot a line of best fit along with a 95% confidence interval via bootstrap resampling.…”
Section: Trainingmentioning
confidence: 59%
“…On the other hand, masked language models can condition on both the left and right contexts to infill a masked region, however, their training objective is typically limited to generating only about 15% of a document. In this paper, we adopt the recently proposed causal masking objective [2], which aims to combine the strengths of both causal and masked language models.…”
Section: Infilling and Synthesis Via Causal Maskingmentioning
confidence: 99%
See 2 more Smart Citations
“…Many previous works have trained GANs [21] on publicly available image captioning datasets to produce text-conditional image samples [56,63,49,58,57]. Other works have adapted the VQ-VAE approach [52] to text-conditional image generation by training autoregressive transformers on sequences of text tokens followed by image tokens [40,12,1]. Finally, some works have applied diffusion models to the problem, training either continuous [35] or discrete [22] diffusion models with auxiliary text encoders to handle textual input.…”
Section: Related Workmentioning
confidence: 99%