CM3: A Causal Masked Multimodal Model of the Internet

Aghajanyan, Armen; Bernie, Huang,; Ross, Candace; Karpukhin, Vladimir; Xu, Huibin; Goyal, Naman; Okhonko, Dmytro; Joshi, Mandar; Ghosh, Gargi; Lewis, Michael A.; Zettlemoyer, Luke

doi:10.48550/arxiv.2201.07520

Cited by 14 publications

(28 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare the validation perplexity of the 6B parameter model and a smaller 1.3B parameter model (see Section 6.2 for details on the training of this 1.3B model) in Figure 2, showing comparable scaling laws to those reported by Aghajanyan et al [2] HumanEval Pass@1 We plot a line of best fit along with a 95% confidence interval via bootstrap resampling.…”

Section: Trainingmentioning

confidence: 59%

“…On the other hand, masked language models can condition on both the left and right contexts to infill a masked region, however, their training objective is typically limited to generating only about 15% of a document. In this paper, we adopt the recently proposed causal masking objective [2], which aims to combine the strengths of both causal and masked language models.…”

Section: Infilling and Synthesis Via Causal Maskingmentioning

confidence: 99%

“…We perform one epoch on the training data, using each training document exactly once. Our implementation utilized the causal masking implementation [2] available in Fairseq [47], with the underlying library being PyTorch [49]. Our per-GPU batch size was 8, with a maximum token sequence length of 2048.…”

Section: Trainingmentioning

confidence: 99%

“…Like prior work, INCODER is trained to maximize the likelihood of a corpus of code. However, we adopt a causal masking objective [2], allowing INCODER to infill blocks of code conditioned on arbitrary left and right contexts.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

InCoder: A Generative Model for Code Infilling and Synthesis

Fried¹,

Aghajanyan²,

Lin³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce INCODER, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first large generative code model that is able to infill arbitrary regions of code, which we evaluate in a zero-shot setting on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The INCODER models and code are publicly released.2 * Equal contribution 2 https://sites.google.com/view/incoder-code-models

show abstract

Section: Trainingmentioning

confidence: 59%

Section: Infilling and Synthesis Via Causal Maskingmentioning

confidence: 99%

Section: Trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

InCoder: A Generative Model for Code Infilling and Synthesis

Fried¹,

Aghajanyan²,

Lin³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Many previous works have trained GANs [21] on publicly available image captioning datasets to produce text-conditional image samples [56,63,49,58,57]. Other works have adapted the VQ-VAE approach [52] to text-conditional image generation by training autoregressive transformers on sequences of text tokens followed by image tokens [40,12,1]. Finally, some works have applied diffusion models to the problem, training either continuous [35] or discrete [22] diffusion models with auxiliary text encoders to handle textual input.…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh¹,

Dhariwal²,

Nichol³

et al. 2022

Preprint

538

845

View full text Add to dashboard Cite

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

show abstract

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Li,

Chang,

Mishra

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce ITIT (InTegrating Image Text): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.

show abstract

CM3: A Causal Masked Multimodal Model of the Internet

Cited by 14 publications

References 27 publications

InCoder: A Generative Model for Code Infilling and Synthesis

InCoder: A Generative Model for Code Infilling and Synthesis

Hierarchical Text-Conditional Image Generation with CLIP Latents

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Contact Info

Product

Resources

About