WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

Liu, Alisa; Swayamdipta, Swabha; Smith, Noah A.; Choi, Yejin

doi:10.48550/arxiv.2201.05955

Cited by 16 publications

(27 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2 It is therefore tempting to combine active learning with the use of a foundation model, to improve sample efficiency beyond what either method can do alone. While prior work has attempted to leverage language models to automate part of the labeling process [48,25], we are not aware of any work fine-tuning language models that has tried actively selecting points for human feedback. We seek to fill this gap in the literature in the remainder of the paper.…”

Section: Related Workmentioning

confidence: 99%

Uncertainty Estimation for Language Reward Models

Gleave¹,

Irving²

2022

Preprint

View full text Add to dashboard Cite

Language models can learn a range of capabilities from unsupervised training on text corpora. However, to solve a particular problem (such as text summarization) it is typically necessary to fine-tune them on a task-specific dataset. It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons. However, collecting a large preference comparison dataset is still expensive-and the learned reward models are unreliable out-of-distribution. We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning (RL). Specifically, we use bootstrap aggregating (bagging) to train an ensemble of reward models differing in the initialization of their final layer. Ensembles have proved successful in prior applications of active learning [9, 3], but we find that in our setting ensemble active learning does not outperform random sampling. Further experiments show that while the aggregate predictions are well-calibrated, the ensemble's estimated epistemic uncertainty is only weakly correlated with model error. We suspect this is because the ensemble members are fine-tuned from a single model and so are similar to one another. This suggests current pre-training methods will need to be modified to support uncertainty estimation, e.g. by training multiple language models. * Work conducted during an internship at DeepMind. Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

Uncertainty Estimation for Language Reward Models

Gleave¹,

Irving²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The benchmark is comprised of 115,530 sentence pairs, which of 8,421 idioms. A recurrent challenge in crowdsourcing NLPoriented datasets at scale-level is that human writers frequently utilize repetitive patterns to fabricate examples, leading to a lack of linguistic diversity [11]. A new large-scale CIP dataset is created in this study by taking advantage of the collaboration between humans and machines.…”

Section: Tablementioning

confidence: 99%

“…The out-of-domain test set is individually collected by native Chinese crowd-workers without human and machine collaboration. The crowd-workers often take limited writing strategies to speed up the establishment of a dataset, which is harmful to the diversity of the dataset [11], [29]. The quality of out-of-domain test set can be further improved.…”

Section: Human Evaluationmentioning

confidence: 99%

Chinese Idiom Paraphrasing

Qiang¹,

Li²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Idioms, are a kind of idiomatic expression in Chinese, most of which consist of four Chinese characters. Due to the properties of non-compositionality and metaphorical meaning, Chinese Idioms are hard to be understood by children and non-native speakers. This study proposes a novel task, denoted as Chinese Idiom Paraphrasing (CIP). CIP aims to rephrase idioms-included sentences to non-idiomatic ones under the premise of preserving the original sentence's meaning. Since the sentences without idioms are easier handled by Chinese NLP systems, CIP can be used to pre-process Chinese datasets, thereby facilitating and improving the performance of Chinese NLP tasks, e.g., machine translation system, Chinese idiom cloze, and Chinese idiom embeddings. In this study, CIP task is treated as a special paraphrase generation task. To circumvent difficulties in acquiring annotations, we first establish a large-scale CIP dataset based on human and machine collaboration, which consists of 115,530 sentence pairs. We further deploy three baselines and two novel CIP approaches to deal with CIP problems. The results show that the proposed methods have better performances than the baselines based on the established CIP dataset.

show abstract

“…Scalable Oversight As AI systems become more capable of generating candidate responses, an emerging line of research supervises AI systems by providing preferences over AI-generated candidates rather than providing human demonstrations (Stiennon et al, 2020;Wiegreffe et al, 2021;Askell et al, 2021;Liu et al, 2022;Ouyang et al, 2022). Therefore, to supervise AI to perform more complex tasks, it becomes increasingly important to determine human preferences over model outputs that are expensive to verify, such as full-book summaries or natural language descriptions of distributional properties (Amodei et al, 2016;Wu et al, 2021;Zhong et al, 2022).…”

Section: Related Workmentioning

confidence: 99%

Active Programming by Example with a Natural Language Prior

Zhong¹,

Snell²,

Klein³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce APEL, a new framework that enables non-programmers to indirectly annotate natural language utterances with executable meaning representations, such as SQL programs. Based on a natural language utterance, we first run a seed semantic parser to generate a prior over a list of candidate programs. To obtain information about which candidate is correct, we synthesize an input on which the more likely programs tend to produce different outputs, and ask an annotator which output is appropriate for the utterance. Hence, the annotator does not have to directly inspect the programs. To further reduce effort required from annotators, we aim to synthesize simple input databases that nonetheless have high information gain. With human annotators and Bayesian inference to handle annotation errors, we outperform Codex's top-1 performance (59%) and achieve the same accuracy as the original expert annotators (75%), by soliciting answers for each utterance on only 2 databases with an average of 9 records each. In contrast, it would be impractical to solicit outputs on the original 30K-record databases provided by SPIDER.

show abstract

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

Cited by 16 publications

References 16 publications

Uncertainty Estimation for Language Reward Models

Uncertainty Estimation for Language Reward Models

Chinese Idiom Paraphrasing

Active Programming by Example with a Natural Language Prior

Contact Info

Product

Resources

About