BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition

Li, Yinghao; Shetty, Pranav; Liu, Lucas; Zhang, Chao; Song, Le

doi:10.18653/v1/2021.acl-long.482

Cited by 15 publications

(29 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most studies focus on developing label models while leaving the end model flexible to the downstream tasks. Existing label models include Majority Voting (MV), Probabilistic Graphical Models (PGM) [14,77,75,22,53,82,50], etc.. Note that prior crowd-worker modeling work can be included and subsumed by this set of approaches, e.g.…”

Section: Two-stage Methodsmentioning

confidence: 99%

“…[53,71] use a standard HMM with multiple observed variables, each from one labeling source. [82] improves HMM by introducing unique linking rules as an additional supervision source; [50] predicts token-wise transition and emission probabilities from BERT embeddings to utilize the context information. Besides, [45] is an one-stage method that models each labeling source by a CRF layer and aggregates their transitions with an attention network.…”

Section: Related Workmentioning

confidence: 99%

“…(We discuss the detailed procedure on adapting label model for sequence tagging tasks in Appendix D) However, these models neglect the internal dependency between labels within the sequence. In contrast, HMM and CHMM take the whole Following the standard protocols, we use entity-level F1-score as the metric [53,60] and use BIO schema [88,50,53], which labels the beginning token of an entity as B-X and the other tokens inside that entity as I-X, while non-entity tokens are marked as O. For methods that predict token-level labels (e.g.MV), we transform token-level predictions to entity-level predictions when calculating the F1 score.…”

Section: Evaluation Protocolmentioning

confidence: 99%

“…This dataset accompanies the BioCreative V CDR challenge and consists of 1,500 PubMed articles and is annotated with chemical and disease mentions. The labeling functions are selected from [50]. (We use DictCore-Chemical, DictCore-Chemical-Exact, DictCore-Disease, DictCore-Disease-Exact, Element, Ion, or Isotope, Organic Chemical, Antibiotic, Disease or Syndrome, PostHyphen, ExtractedPhrase as weak supervision sources.)…”

Section: B3 Sequence Tagging Datasetsmentioning

confidence: 99%

“…To reduce the efforts of annotation, recent weak supervision (WS) frameworks have been proposed which focus on enabling users to leverage a diversity of weaker, often programmatic supervision sources [76,77,75] to label and manage training data in an efficient way. Recently, WS has been widely applied to various machine learning tasks in a diversity of domains: scene graph prediction [9], video analysis [23,92], image classification [12], image segmentation [35], autonomous driving [96], relation extraction [36,107,57], named entity recognition [82,53,50,45,27], text classification [78,100,85,86], dialogue system [63], biomedical [43,19,64], healthcare [20,17,21,80,93,81], software engineering [74], sensors data [24,39], E-commerce [66,103], and multi-agent systems [102].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

WRENCH: A Comprehensive Benchmark for Weak Supervision

Zhang,

Yu,

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recent Weak Supervision (WS) approaches have had widespread success in easing the bottleneck of labeling training data for machine learning by synthesizing labels from multiple potentially noisy supervision sources. However, proper measurement and analysis of these approaches remain a challenge. First, datasets used in existing works are often private and/or custom, limiting standardization. Second, WS datasets with the same name and base data often vary in terms of the labels and weak supervision sources used, a significant "hidden" source of evaluation variance. Finally, WS studies often diverge in terms of the evaluation protocol and ablations used. To address these problems, we introduce a benchmark platform, WRENCH, for thorough and standardized evaluation of WS approaches. It consists of 22 varied real-world datasets for classification and sequence tagging; a range of real, synthetic, and procedurally-generated weak supervision sources; and a modular, extensible framework for WS evaluation, including implementations for popular WS methods. We use WRENCH to conduct extensive comparisons over more than 100 method variants to demonstrate its efficacy as a benchmark platform. The code is available at https://github.com/JieyuZ2/wrench.Preprint. Under review.

show abstract