2022
DOI: 10.48550/arxiv.2210.11399
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Transcending Scaling Laws with 0.1% Extra Compute

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(6 citation statements)
references
References 0 publications
0
6
0
Order By: Relevance
“…UL2-20B, when tested with FLAN instruction tuning, achieved a competitive score compared to FLAN-PaLM 62B on MMLU and Big-Bench benchmarks. After using the MoD objective, U-PaLM [76] achieved the same performance as PaLM-540B but with only half of its computational budget.…”
Section: Regular Denoisingmentioning
confidence: 90%
“…UL2-20B, when tested with FLAN instruction tuning, achieved a competitive score compared to FLAN-PaLM 62B on MMLU and Big-Bench benchmarks. After using the MoD objective, U-PaLM [76] achieved the same performance as PaLM-540B but with only half of its computational budget.…”
Section: Regular Denoisingmentioning
confidence: 90%
“…One major improvement in the advancement of LLMs is using instruction tuning [25]. U-PaLM [26] significantly increases zero-shot performance of PaLM with only 0.1% extra compute, by applying the mixture of denoising training objective from UL2 [27] to a pretrained PaLM model. Flan-PaLM [28] further improves on that by using both instruction-tuning and chain-ofthought prompting.…”
Section: Related Workmentioning
confidence: 99%
“…In this section, we delve deeper into the feasibility of simultaneously optimizing the two distinct objectives. Unlike existing unified pre-training frameworks [16,17,18,40,41] which employ analogous formulations to pre-train various objectives, we explore how to extend the conclusions drawn from a similar training format to a broader setting. Specifically, we investigate if a model optimized via in-place token predictions benefits the one trained via the next-token prediction regime, and vice versa.…”
Section: The Empirical Analysis Of Unified Trainingmentioning
confidence: 99%