Snorkel

Ratner, Alexander; Bach, Stephen H.; Ehrenberg, Henry R.; Fries, Jason; Wu, Sen; Ré, Christopher

doi:10.14778/3157794.3157797

Cited by 457 publications

(68 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, our results hint that periphery nodes could also be noisy sources of information and possibly warrant omission in standard link prediction. Our fringe measurements can also be viewed as adding noisy training data, which is related to training data augmentation methods [29,30].…”

Section: Discussionmentioning

confidence: 99%

Link Prediction in Networks with Core-Fringe Data

Benson

Kleinberg

2019

The World Wide Web Conference

View full text Add to dashboard Cite

Data collection often involves the partial measurement of a larger system. A common example arises in collecting network data: we often obtain network datasets by recording all of the interactions among a small set of core nodes, so that we end up with a measurement of the network consisting of these core nodes along with a potentially much larger set of fringe nodes that have links to the core. Given the ubiquity of this process for assembling network data, it is crucial to understand the role of such a "core-fringe" structure.Here we study how the inclusion of fringe nodes affects the standard task of network link prediction. One might initially think the inclusion of any additional data is useful, and hence that it should be beneficial to include all fringe nodes that are available. However, we find that this is not true; in fact, there is substantial variability in the value of the fringe nodes for prediction. Once an algorithm is selected, in some datasets, including any additional data from the fringe can actually hurt prediction performance; in other datasets, including some amount of fringe information is useful before prediction performance saturates or even declines; and in further cases, including the entire fringe leads to the best performance. While such variety might seem surprising, we show that these behaviors are exhibited by simple random graph models.1 This distinction between core and fringe is fundamentally driven by measurement of the available data; we have measured all interactions involving members of the core, and this brings the fringe indirectly into the data. As such, it is distinct from work on the core-periphery structure of networks, which typically refers to settings in which the core and periphery both fully belong to the measured network, and the distinction is in the level of centrality or status that the core has relative to the periphery [7,16,31,37].

show abstract

Section: Discussionmentioning

confidence: 99%

Link Prediction in Networks with Core-Fringe Data

Benson

Kleinberg

2019

The World Wide Web Conference

View full text Add to dashboard Cite

show abstract

“…Some of them use filtering to identify potentially mislabeled examples in the training dataset. This kind of filter is usually based on the labels of close neighbours (similar instances) [16] or exploit the disagreements in the prediction of classifiers trained on different portions of the dataset [15,17].…”

Section: Classification Under Weak Supervisionmentioning

confidence: 99%

Beyond MeSH: Fine-Grained Semantic Indexing of Biomedical Literature Based on Weak Supervision

Nentidis

Krithara

Tsoumakas

et al. 2019

2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS)

View full text Add to dashboard Cite

In this work, we propose a method for the automated refinement of subject annotations in biomedical literature at the level of concepts. Semantic indexing and search of biomedical articles in MEDLINE/PubMed are based on semantic subject annotations with MeSH descriptors that may correspond to several related but distinct biomedical concepts. Such semantic annotations do not adhere to the level of detail available in the domain knowledge and may not be sufficient to fulfil the information needs of experts in the domain. To this end, we propose a new method that uses weak supervision to train a concept annotator on the literature available for a particular disease. We test this method on the MeSH descriptors for two diseases: Alzheimer's Disease and Duchenne Muscular Dystrophy. The results indicate that concept-occurrence is a strong heuristic for automated subject annotation refinement and its use as weak supervision can lead to improved concept-level annotations. The fine-grained semantic annotations can enable more precise literature retrieval, sustain the semantic integration of subject annotations with other domain resources and ease the maintenance of consistent subject annotations, as new more detailed entries are added in the MeSH thesaurus over time.

show abstract

“…In [19], Ratner et al transform a set of weak supervision sources, that may disagree with each other, into soft labels used to train a discriminative model. They show experimentally that this approach outperforms the naïve majority voting strategy for generating the target labels.…”

Section: Related Workmentioning

confidence: 99%

Learning More From Less

Haddad¹,

Ghosh²

2019

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

The limited availability of ground truth relevance labels has been a major impediment to the application of supervised methods to ad-hoc retrieval. As a result, unsupervised scoring methods, such as BM25, remain strong competitors to deep learning techniques which have brought on dramatic improvements in other domains, such as computer vision and natural language processing. Recent works have shown that it is possible to take advantage of the performance of these unsupervised methods to generate training data for learning-to-rank models. The key limitation to this line of work is the size of the training set required to surpass the performance of the original unsupervised method, which can be as large as 10 13 training examples. Building on these insights, we propose two methods to reduce the amount of training data required. The first method takes inspiration from crowdsourcing, and leverages multiple unsupervised rankers to generate soft, or noise-aware, training labels. The second identifies harmful, or mislabeled, training examples and removes them from the training set. We show that our methods allow us to surpass the performance of the unsupervised baseline with far fewer training examples than previous works. CCS CONCEPTS• Information systems → Retrieval models and ranking.

show abstract

Snorkel

Cited by 457 publications

References 42 publications

Link Prediction in Networks with Core-Fringe Data

Link Prediction in Networks with Core-Fringe Data

Beyond MeSH: Fine-Grained Semantic Indexing of Biomedical Literature Based on Weak Supervision

Learning More From Less

Contact Info

Product

Resources

About