WRENCH: A Comprehensive Benchmark for Weak Supervision

Zhang, Jieyu; Yu, Yue; Li, Yinghao; Wang, Yujing; Yang, Yaming; Yang, Mao; Ratner, Alexander

doi:10.48550/arxiv.2109.11377

Cited by 10 publications

(15 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluate our framework on nine benchmark NLP classification datasets that are popular in the few-shot learning and weak supervision literature (Ratner et al, 2017;Awasthi et al, 2020;Zhang et al, 2021a;Cohan et al, 2019). These tasks are as follows: AGNews: using news headlines to predict article topic, CDR: using scientific paper excerpts to predict whether drugs induce diseases, ChemProt: using paper experts to predict the functional relationship between chemicals and proteins, IMDB: movie review sentiment, SciCite: classifying citation intent in Computer Science papers, SemEval: relation classification from web text, SMS: text message spam detection, TREC: conversational question intent classification, Youtube: internet comment spam detection.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Automatic Rule Induction for Efficient and Interpretable Semi-Supervised Learning

Pryzant¹,

Yang²,

Yi‐chong³

et al. 2022

Preprint

View full text Add to dashboard Cite

Semi-supervised learning has shown promise in allowing NLP models to generalize from small amounts of labeled data. Meanwhile, pretrained transformer models act as blackbox correlation engines that are difficult to explain and sometimes behave unreliably. In this paper, we propose tackling both of these challenges via Automatic Rule Induction (ARI), a simple and general-purpose framework for the automatic discovery and integration of symbolic rules into pretrained transformer models. First, we extract weak symbolic rules from low-capacity machine learning models trained on small amounts of labeled data. Next, we use an attention mechanism to integrate these rules into high-capacity pretrained transformer models. Last, the rule-augmented system becomes part of a self-training framework to boost supervision signal on unlabeled data. These steps can be layered beneath a variety of existing weak supervision and semisupervised NLP algorithms in order to improve performance and interpretability. Experiments across nine sequence classification and relation extraction tasks suggest that ARI can improve state-of-the-art methods with no manual effort and minimal computational overhead.

show abstract

Section: Methodsmentioning

confidence: 99%

“…We ran all experiments on Microsoft Azure cloud compute using NVIDIA V100 GPUs (32G VRAM). All algorithms were implemented using the Pytorch and Wrench frameworks (Paszke et al, 2017;Zhang et al, 2021a). We report binary F1 score for binary classification tasks and macro-weighted F1 for multiclass classification tasks.…”

Section: Methodsmentioning

confidence: 99%

Automatic Rule Induction for Efficient and Interpretable Semi-Supervised Learning

Pryzant¹,

Yang²,

Yi‐chong³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Also, while Snorkel methods are data free and use only the weak signals to estimate the labels of the data, our method is data dependent and use features of the data to make the generated labels consistent with the data. Concurrent to our work, a new weak supervision benchmark has been developed (Zhang et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

Data Consistency for Weakly Supervised Learning

Arachie¹,

Huang²

2022

Preprint

View full text Add to dashboard Cite

In many applications, training machine learning models involves using large amounts of humanannotated data. Obtaining precise labels for the data is expensive. Instead, training with weak supervision provides a low-cost alternative. We propose a novel weak supervision algorithm that processes noisy labels, i.e., weak signals, while also considering features of the training data to produce accurate labels for training. Our method searches over classifiers of the data representation to find plausible labelings. We call this paradigm data consistent weak supervision. A key facet of our framework is that we are able to estimate labels for data examples low or no coverage from the weak supervision. In addition, we make no assumptions about the joint distribution of the weak signals and true labels of the data. Instead, we use weak signals and the data features to solve a constrained optimization that enforces data consistency among the labels we generate. Empirical evaluation of our method on different datasets shows that it significantly outperforms state-of-the-art weak supervision methods on both text and image classification tasks.

show abstract

“…For image classification tasks, we follow Mazzetto et al (2021b;a) to train a branch of image classifiers as supervision sources of seen classes. For text classification tasks, we made keywordbased labeling functions as supervision sources of seen classes following Zhang et al (2021); each of the labeling functions returns its associated label when a certain keyword exists in the text, otherwise abstains. Notably, all the involved supervision sources are "weak" because they cannot predict the desired unseen classes.…”

Section: Setupmentioning

confidence: 99%

“…We use a pre-trained sentence transformer (Reimers & Gurevych, 2019) to obtain document embeddings for classification. We follow Zhang et al (2021) to generate 5 keyword-based labeling functions for each seen label as ILFs.…”

Section: F Experimental Detailsmentioning

confidence: 99%

Creating Training Sets via Weak Indirect Supervision

Zhang¹,

Wang²,

Song³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Creating labeled training sets has become one of the major roadblocks in machine learning. To address this, recent Weak Supervision (WS) frameworks synthesize training labels from multiple potentially noisy supervision sources. However, existing frameworks are restricted to supervision sources that share the same output space as the target task. To extend the scope of usable sources, we formulate Weak Indirect Supervision (WIS), a new research problem for automatically synthesizing training labels based on indirect supervision sources that have different output label spaces. To overcome the challenge of mismatched output spaces, we develop a probabilistic modeling approach, PLRM, which uses user-provided label relations to model and leverage indirect supervision sources. Moreover, we provide a theoretically-principled test of the distinguishability of PLRM for unseen labels, along with an generalization bound. On both image and text classification tasks as well as an industrial advertising application, we demonstrate the advantages of PLRM by outperforming baselines by a margin of 2%-9%.

show abstract

WRENCH: A Comprehensive Benchmark for Weak Supervision

Cited by 10 publications

References 56 publications

Automatic Rule Induction for Efficient and Interpretable Semi-Supervised Learning

Automatic Rule Induction for Efficient and Interpretable Semi-Supervised Learning

Data Consistency for Weakly Supervised Learning

Creating Training Sets via Weak Indirect Supervision

Contact Info

Product

Resources

About