Analysing Wikipedia and gold-standard corpora for NER training

Nothman, Joel; Murphy, Tara; Curran, James

doi:10.3115/1609067.1609135

Cited by 31 publications

(29 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is a handful of works aiming to pave the road towards zero-shot typing by addressing ways to extract cheap signals, often to help the supervised algorithms: e.g., by generating gazetteers (Nadeau et al, 2006), or using the anchor texts in Wikipedia (Nothman et al, 2008(Nothman et al, , 2009). Ren et al (2016) project labels in highdimensional space and use label correlations to suppress noise and better model their relations.…”

Section: Related Workmentioning

confidence: 99%

Zero-Shot Open Entity Typing as Type-Compatible Grounding

Zhou¹,

Khashabi²,

Tsai³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

The problem of entity-typing has been studied predominantly in supervised learning fashion, mostly with task-specific annotations (for coarse types) and sometimes with distant supervision (for fine types). While such approaches have strong performance within datasets, they often lack the flexibility to transfer across text genres and to generalize to new type taxonomies. In this work we propose a zero-shot entity typing approach that requires no annotated data and can flexibly identify newly defined types.Given a type taxonomy defined as Boolean functions of FREEBASE "types", we ground a given mention to a set of type-compatible Wikipedia entries and then infer the target mention's types using an inference algorithm that makes use of the types of these entries. We evaluate our system on a broad range of datasets, including standard fine-grained and coarse-grained entity typing datasets, and also a dataset in the biological domain. Our system is shown to be competitive with state-of-theart supervised NER systems and outperforms them on out-of-domain datasets. We also show that our system significantly outperforms other zero-shot fine typing systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Zero-Shot Open Entity Typing as Type-Compatible Grounding

Zhou¹,

Khashabi²,

Tsai³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Cases like White House being classified as location rather than organization are a common confusion (Nothman et al, 2009). Similarly, Rothko can be considered a person or product entity.…”

Section: Fine-grained Dutch Named Entity Recognitionmentioning

confidence: 99%

“…The diversity in text types, which was lacking in the Dutch CoNLL-2002 dataset, should allow for a more robust classifier and better cross-corpus performance (Nothman et al, 2009). It should also make SoNaR 1 an interesting corpus for research on domain adaptation.…”

Section: Datasetmentioning

confidence: 99%

Fine-grained Dutch named entity recognition

Desmet

Hoste

2013

Lang Resources & Evaluation

View full text Add to dashboard Cite

This paper describes the creation of a fine-grained named entity annotation scheme and corpus for Dutch, and experiments on automatic main type and subtype named entity recognition. We give an overview of existing named entity annotation schemes, and motivate our own, which describes six main types (persons, organizations, locations, products, events and miscellaneous named entities) and finer-grained information on subtypes and metonymic usage. This was applied to a one-million-word subset of the Dutch SoNaR reference corpus. The classifier for main type named entities achieves a micro-averaged F-score of 84.91%, and is publicly available, along with the corpus and annotations.

show abstract

“…In this sense, we must mention there exists recent interesting work using Wikipedia as gold standard corpora to train supervised NEC classifiers [19].…”

Section: Related Workmentioning

confidence: 99%

A Resource-Based Method for Named Entity Extraction and Classification

Gamallo

García

2011

Progress in Artificial Intelligence

View full text Add to dashboard Cite

Abstract. We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Languageindependent heuristics are used to disambiguate and classify entities that have been already identified (or recognized) in text. We compare the performance of our resource-based system with that of a supervised NEC module implemented for the FreeLing suite, which was the winner system in CoNLL-2002 competition. Experiments were performed over Portuguese text corpora taking into account several domains and genres.

show abstract

Analysing Wikipedia and gold-standard corpora for NER training

Cited by 31 publications

References 13 publications

Zero-Shot Open Entity Typing as Type-Compatible Grounding

Zero-Shot Open Entity Typing as Type-Compatible Grounding

Fine-grained Dutch named entity recognition

A Resource-Based Method for Named Entity Extraction and Classification

Contact Info

Product

Resources

About