OPT: Open Pre-trained Transformer Language Models

Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Moya, Chen,; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Victoria, Lin, Xi; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Daniel, Simig,; Singh, Koura, Punit; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke

doi:10.48550/arxiv.2205.01068

Cited by 180 publications

(236 citation statements)

References 24 publications

Supporting

Mentioning

231

Contrasting

Unclassified

Order By: Relevance

“…Model Architectures: We replicate publicly available references for Transformer language model architectures [53,54]. We use the 125 million, 355 million, 1.3 billion, 2.7 billion, 6.7 billion, and 13 billion model configurations (see § A.4 for more explicit architecture and hyperparameter configurations).…”

Section: Methodsmentioning

confidence: 99%

“…In this section, we layout the details of experiments, although most training details we pull directly from publicly available references [53,54]. As such, we provide the details of model architectures using the same style as Table 1 in [53] for ease of comparison. All models use GELU activation [74] for nonlinearity.…”

Section: A4 Model Training/dataset Detailsmentioning

confidence: 99%

“…For reproducibility, we set weight decay to 0, dropout to 0, and attention dropout to 0. We use a polynomial learning rate schedule, and following [53,54] we scale up our learning rate from 0 to the maximum learning rate over 375M tokens, and scale down to 0 over the remaining T − 375M tokens (for all masked language modeling experiments, and all ROBERTA experiments, we have T = 300B; for causal language modeling experiments on WIKITEXT103 we have T = 100B).…”

Section: A4 Model Training/dataset Detailsmentioning

confidence: 99%

“…Global batch size denotes the total number of tokens the model processes in a batch of data. Note that most of the values in this table are the same as Table1in[53].…”

mentioning

confidence: 99%

See 3 more Smart Citations

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Tirumala¹,

Markosyan²,

Zettlemoyer³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. We empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process. We measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings. Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process. We also analyze the memorization dynamics of different parts of speech and find that models memorize nouns and numbers first; we hypothesize and provide empirical evidence that nouns and numbers act as a unique identifier for memorizing individual training examples. Together, these findings present another piece of the broader puzzle of trying to understand what actually improves as models get bigger.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: A4 Model Training/dataset Detailsmentioning

confidence: 99%

Section: A4 Model Training/dataset Detailsmentioning

confidence: 99%

“…Global batch size denotes the total number of tokens the model processes in a batch of data. Note that most of the values in this table are the same as Table1in[53].…”

mentioning

confidence: 99%

See 2 more Smart Citations

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Tirumala¹,

Markosyan²,

Zettlemoyer³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Triggered by GPT-3, a plethora of other large language models, which are different variants of the transformer architecture [28], have been developed. Some of the most powerful ones are PaLM [4], GLaM [6], Megatron-Turing NLG [23], Meta-OPT [31], Gopher [21], LaMDA [27] and Chinchilla [9]. PaLM currently provides the state-of-the-art performance in NLP tasks such as natural language translation, predicting long-range text dependencies and even translation to structured representations [4].…”

Section: Introductionmentioning

confidence: 99%

Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change)

Valmeekam¹,

Olmo²,

Sreedharan³

et al. 2022

Preprint

View full text Add to dashboard Cite

The recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP). From GPT-3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model. Along with natural language abilities, there has been a significant interest in understanding whether such models, trained on enormous amounts of data, exhibit reasoning capabilities. Hence there has been interest in developing benchmarks for various reasoning tasks and the preliminary results from testing LLMs over such benchmarks seem mostly positive. However, the current benchmarks are relatively simplistic and the performance over these benchmarks cannot be used as an evidence to support, many a times outlandish, claims being made about LLMs' reasoning capabilities. As of right now, these benchmarks only represent a very limited set of simple reasoning tasks and we need to look at more sophisticated reasoning problems if we are to measure the true limits of such LLM-based systems. With this motivation, we propose an extensible assessment framework to test the abilities of LLMs on a central aspect of human intelligence, which is reasoning about actions and change. We provide multiple test cases that are more involved than any of the previously established reasoning benchmarks and each test case evaluates a certain aspect of reasoning about actions and change. Initial evaluation results on the base version of GPT-3 (Davinci), showcase subpar performance on these benchmarks.

show abstract

An Outbreak of Fungal Endophthalmitis After Cataract Surgery in South Korea

Kim

Choi

et al. 2023

JAMA Ophthalmol

View full text Add to dashboard Cite

ImportanceFungal endophthalmitis caused by contaminated medical products is extremely rare; it follows an intractable clinical course with a poor visual prognosis.ObjectiveTo report the epidemiologic and clinical features and treatment outcomes of a nationwide fungal endophthalmitis outbreak after cataract surgery as a result of contaminated viscoelastic agents in South Korea.Design, Setting, and ParticipantsThis was a retrospective case series analysis of clinical data from multiple institutions in South Korea conducted from September 1, 2020, to October 31, 2021. Data were collected through nationwide surveys in May and October 2021 from the 100 members of the Korean Retinal Society. Patients were diagnosed with fungal endophthalmitis resulting from the use of the viscoelastic material sodium hyaluronate (Unial [Unimed Pharmaceutical Inc]). Data were analyzed from November 1, 2021, to May 30, 2022.Main Outcomes and MeasuresThe clinical features and causative species were identified, and treatment outcomes were analyzed for patients who underwent 6 months of follow-up.ResultsThe fungal endophthalmitis outbreak developed between September 1, 2020, and June 30, 2021, and peaked in November 2020. An official investigation by the Korea Disease Control and Prevention Agency confirmed contamination of viscoelastic material. All 281 eyes of 265 patients (mean [SD] age, 65.4 [10.8] years; 153 female individuals [57.7%]) were diagnosed with fungal endophthalmitis, based on clinical examinations and supportive culture results. The mean (SD) time period between cataract surgery and diagnosis was 24.7 (17.3) days. Patients exhibited characteristic clinical features of fungal endophthalmitis, including vitreous opacity (212 of 281 [75.4%]), infiltration into the intraocular lens (143 of 281 [50.9%]), and ciliary infiltration (55 of 281 [19.6%]). Cultures were performed in 260 eyes, and fungal presence was confirmed in 103 eyes (39.6%). Among them, Fusarium species were identified in 89 eyes (86.4%). Among the 228 eyes included in the treatment outcome analysis, the mean (SD) best-corrected visual acuity improved from 0.78 (0.74) logMAR (Snellen equivalent, 20/120 [7.3 lines]) to 0.36 (0.49) logMAR (Snellen equivalent, 20/45 [4.9 lines]) at 6 months. Furthermore, disease remission with no signs of fungal endophthalmitis (or cells in the anterior chamber milder than grade 1) was noted in 214 eyes (93.9%).Conclusions and RelevanceThis was a retrospectively reviewed case series of a fungal endophthalmitis outbreak resulting from contaminated viscoelastic material. Findings of this case series study support the potential benefit of prompt, aggressive surgical intervention that may reduce treatment burden and improve prognosis of fungal endophthalmitis caused by contaminated medical products.

show abstract

OPT: Open Pre-trained Transformer Language Models

Cited by 180 publications

References 24 publications

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change)

An Outbreak of Fungal Endophthalmitis After Cataract Surgery in South Korea

Contact Info

Product

Resources

About