The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi-Model Annotation in the Biomedical Domain

Cohen, K. Bretonnel; Verspoor, Karin; Fort, Karën; Funk, Christopher; Bada, Michael; Palmer, Martha; Hunter, Lawrence

doi:10.1007/978-94-024-0881-2_53

Cited by 30 publications

(18 citation statements)

References 60 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The CRAFT corpus (Bada et al, 2012;Cohen et al, 2017) is a collection of 97 full-text articles, of which 30 have been released only in the course of the present shared task. The documents were manually annotated with respect to 10 different entity types, linked to 8 manually curated ontologies of biomedical terminology: In addition, the annotations are distributed in an extended variant, i. e. CHEBI EXT, CL EXT etc., resulting in a total of 20 annotation sets.…”

Section: Datamentioning

confidence: 99%

UZH@CRAFT-ST: a Sequence-labeling Approach to Concept Recognition

Furrer¹,

Cornelius²,

Rinaldi³

2019

Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

View full text Add to dashboard Cite

As our submission to the CRAFT shared task 2019, we present two neural approaches to concept recognition. We propose two different systems for joint named entity recognition (NER) and normalization (NEN), both of which model the task as a sequence labeling problem. Our first system is a BiLSTM network with two separate outputs for NER and NEN trained from scratch, whereas the second system is an instance of BioBERT fine-tuned on the concept-recognition task. We exploit two strategies for extending concept coverage, ontology pretraining and backoff with a dictionary lookup. Our results show that the backoff strategy effectively tackles the problem of unseen concepts, addressing a major limitation of the chosen design. In the cross-system comparison, BioBERT proves to be a strong basis for creating a concept-recognition system, although some entity types are predicted more accurately by the BiLSTM-based system.

show abstract

Section: Datamentioning

confidence: 99%

UZH@CRAFT-ST: a Sequence-labeling Approach to Concept Recognition

Furrer¹,

Cornelius²,

Rinaldi³

2019

Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

View full text Add to dashboard Cite

show abstract

“…The contents of the CRAFT corpus have been described extensively elsewhere [ 77 – 80 ]. We focus here on descriptive statistics that are specifically relevant to the coreference annotation.…”

Section: Methodsmentioning

confidence: 99%

Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles

Cohen¹,

Lanfranchi

Choi

et al. 2017

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

BackgroundCoreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations.ResultsThe corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus.ConclusionsThe project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.

show abstract

“…20 Our concept normalization system is ConceptMapper, a high-performance customizable dictionary look-up tool implemented as a UIMA component. 23 Funk et al determined that ConceptMapper is the best performing (highest F 1 measure) concept recognition software as compared to others.…”

Section: Methodsmentioning

confidence: 99%

“…Here, we use the Colorado Richly Annotated Full Text Corpus (CRAFT) of full text biomedical journal articles, annotated with concepts from eight different ontologies. 20 As a baseline concept normalization system, we used the best performing systems from Funk, et al, 21 with the precision maximizing parameters for each ontology. For each ontology, we tested for a Zipfian distribution, identified the most common concept errors in PubMed Central Open Access, and tested a set of five different potential pre- and post-processing steps that could improve precision.…”

Section: Introductionmentioning

confidence: 99%

Improving precision in concept normalization

et al. 2017

Self Cite

View full text Add to dashboard Cite

Most natural language processing applications exhibit a trade-off between precision and recall. In some use cases for natural language processing, there are reasons to prefer to tilt that trade-off toward high precision. Relying on the Zipfian distribution of false positive results, we describe a strategy for increasing precision, using a variety of both pre-processing and post-processing methods. They draw on both knowledge-based and frequentist approaches to modeling language. Based on an existing high-performance biomedical concept recognition pipeline and a previously published manually annotated corpus, we apply this hybrid rationalist/empiricist strategy to concept normalization for eight different ontologies. Which approaches did and did not improve precision varied widely between the ontologies.

show abstract

The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi-Model Annotation in the Biomedical Domain

Cited by 30 publications

References 60 publications

UZH@CRAFT-ST: a Sequence-labeling Approach to Concept Recognition

UZH@CRAFT-ST: a Sequence-labeling Approach to Concept Recognition

Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles

Improving precision in concept normalization

Contact Info

Product

Resources

About