Proceedings of the Second International Conference on Human Language Technology Research - 2002
DOI: 10.3115/1289189.1289260
|View full text |Cite
|
Sign up to set email alerts
|

The GENIA corpus

Abstract: With the information overload in genome-related field, there is an infreest need for natural language processing technology to extract information from literature and various attempts of information extraction using NLP has been being made. We are developing the necessary resources including domain ontology and annotated corpus from research abstracts in MEDLINE database (GENIA corpus). We are building the ontology and the corpus simultaneously, using each other. In this paper we report on our new corpus, its … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2003
2003
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 100 publications
(18 citation statements)
references
References 5 publications
0
18
0
Order By: Relevance
“…GENIA. GENIA (Ohta et al, 2002) is an English nested NER dataset in the molecular biology domain containing five entity types (e.g., DNA and RNA). More details (including entity types, sentence number, and examples) of three nested NER datasets ACE2004, ACE2005, and GENIA can be found in Appendix A.2.…”
Section: Results On Nested Nermentioning
confidence: 99%
“…GENIA. GENIA (Ohta et al, 2002) is an English nested NER dataset in the molecular biology domain containing five entity types (e.g., DNA and RNA). More details (including entity types, sentence number, and examples) of three nested NER datasets ACE2004, ACE2005, and GENIA can be found in Appendix A.2.…”
Section: Results On Nested Nermentioning
confidence: 99%
“…'47 kDa sterol regulatory element binding factor'. In GENIA V3.0 (Ohta et al, 2002), we find that 18.6% of biomedical entity names consist of at least four words, as shown in Figure 1.…”
Section: Introductionmentioning
confidence: 90%
“…All of our experiments are done on GENIA corpus, which is the largest annotated corpus in the molecular biology domain available to public (Ohta et al, 2002). In our experiments, three versions are used.…”
Section: Genia Corpusmentioning
confidence: 99%
“…Datasets We evaluate our method on English and Chinese NER datasets. English datasets include the general domain flat NER CoNLL2003 (Tjong Kim Sang and De Meulder, 2003), the nested NER ACE2005 (Kirkpatrick, 2010), and the biomedical nested NER GENIA (Ohta et al, 2002). Chinese datasets include four commonly used general domain flat NER benchmarks Resume (Zhang and Yang, 2018), Weibo (Peng and Dredze, 2015), MSRA (Levow, 2006) Inference setup For all generative models, we use greedy search with a beam size of 1, a maximum of 512 new tokens, and a temperature of 1.0.…”
Section: Setupmentioning
confidence: 99%