2022
DOI: 10.1162/tacl_a_00492
|View full text |Cite
|
Sign up to set email alerts
|

Generate, Annotate, and Learn: NLP with Synthetic Text

Abstract: This paper studies the use of language models as a source of synthetic unlabeled text for NLP. We formulate a general framework called “generate, annotate, and learn (GAL)” to take advantage of synthetic text within knowledge distillation, self-training, and few-shot learning applications. To generate high-quality task-specific text, we either fine-tune LMs on inputs from the task of interest, or prompt large LMs with few examples. We use the best available classifier to annotate synthetic text with soft pseud… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(6 citation statements)
references
References 45 publications
(51 reference statements)
0
3
0
Order By: Relevance
“…To overcome the limitations of real-world data availability, NLP researchers have explored the use of synthetic datasets for several applications. For example, He et al [42] utilized language models to generate synthetic unlabeled text. They introduced the Generate, Annotate, and Learn (GAL) framework that leverages synthetic text for knowledge distillation, self-training, and few-shot learning purposes.…”
Section: Synthetic Data Collectionmentioning
confidence: 99%
“…To overcome the limitations of real-world data availability, NLP researchers have explored the use of synthetic datasets for several applications. For example, He et al [42] utilized language models to generate synthetic unlabeled text. They introduced the Generate, Annotate, and Learn (GAL) framework that leverages synthetic text for knowledge distillation, self-training, and few-shot learning purposes.…”
Section: Synthetic Data Collectionmentioning
confidence: 99%
“…The quality of current synthetically-generated text has encouraged researchers to explore their potential for complementing labor-intensive tasks, such as annotation and evaluation. For instance, He et al (2022) generated synthetic unlabeled text tailored for a specific NLP task. Then, they used an existing supervised classifier to silver-annotate those sentences, aiming to establish a fully synthetic process for generating, annotating, and learning instances relevant to the target problem.…”
Section: Natural Language Annotation and Data Generation Using Llmsmentioning
confidence: 99%
“…The application of temperature scaling during the distillation process facilitated the transfer of softer probability distributions from the teacher to the student, promoting a more nuanced learning process [38,39]. The introduction of intermediate layer matching aimed to align the representational spaces of the student and teacher models more closely, enhancing the fidelity of knowledge transfer [40][41][42]. Techniques such as progressive distillation, where the student model was iteratively refined using multiple teacher models, sought to aggregate diverse knowledge representations, further improving performance [43,44].…”
Section: Knowledge Distillationmentioning
confidence: 99%