Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics on - EACL '09 2009
DOI: 10.3115/1609067.1609135
|View full text |Cite
|
Sign up to set email alerts
|

Analysing Wikipedia and gold-standard corpora for NER training

Abstract: Named entity recognition (NER) for English typically involves one of three gold standards: MUC, CoNLL, or BBN, all created by costly manual annotation. Recent work has used Wikipedia to automatically create a massive corpus of named entity annotated text.We present the first comprehensive crosscorpus evaluation of NER. We identify the causes of poor cross-corpus performance and demonstrate ways of making them more compatible. Using our process, we develop a Wikipedia corpus which outperforms gold standard corp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
27
0

Year Published

2011
2011
2018
2018

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 31 publications
(29 citation statements)
references
References 13 publications
2
27
0
Order By: Relevance
“…There is a handful of works aiming to pave the road towards zero-shot typing by addressing ways to extract cheap signals, often to help the supervised algorithms: e.g., by generating gazetteers (Nadeau et al, 2006), or using the anchor texts in Wikipedia (Nothman et al, 2008(Nothman et al, , 2009). Ren et al (2016) project labels in highdimensional space and use label correlations to suppress noise and better model their relations.…”
Section: Related Workmentioning
confidence: 99%
“…There is a handful of works aiming to pave the road towards zero-shot typing by addressing ways to extract cheap signals, often to help the supervised algorithms: e.g., by generating gazetteers (Nadeau et al, 2006), or using the anchor texts in Wikipedia (Nothman et al, 2008(Nothman et al, , 2009). Ren et al (2016) project labels in highdimensional space and use label correlations to suppress noise and better model their relations.…”
Section: Related Workmentioning
confidence: 99%
“…Cases like White House being classified as location rather than organization are a common confusion (Nothman et al, 2009). Similarly, Rothko can be considered a person or product entity.…”
Section: Fine-grained Dutch Named Entity Recognitionmentioning
confidence: 99%
“…The diversity in text types, which was lacking in the Dutch CoNLL-2002 dataset, should allow for a more robust classifier and better cross-corpus performance (Nothman et al, 2009). It should also make SoNaR 1 an interesting corpus for research on domain adaptation.…”
Section: Datasetmentioning
confidence: 99%
“…In this sense, we must mention there exists recent interesting work using Wikipedia as gold standard corpora to train supervised NEC classifiers [19].…”
Section: Related Workmentioning
confidence: 99%