Cathy N. Norton scite author profile

BackgroundA scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information.ResultsWe present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9% and recall = 70.5%) compared to a popular dictionary based approach (precision = 97.5% and recall = 54.3%) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central’s full text articles annotated with scientific names, the precision and recall values are 98.5% and 96.2% respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages.ConclusionsWe present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org.

show abstract

uBioRSS: Tracking taxonomic literature using RSS

Leary

Remsen

Norton

et al. 2007

View full text Add to dashboard Cite

show abstract

GenBank and PubMed: How connected are they?

2009

View full text Add to dashboard Cite

BackgroundGenBank(R) is a public repository of all publicly available molecular sequence data from a range of sources. In addition to relevant metadata (e.g., sequence description, source organism and taxonomy), publication information is recorded in the GenBank data file. The identification of literature associated with a given molecular sequence may be an essential first step in developing research hypotheses. Although many of the publications associated with GenBank records may not be linked into or part of complementary literature databases (e.g., PubMed), GenBank records associated with literature indexed in Medline are identifiable as they contain PubMed identifiers (PMIDs).ResultsHere we show that an analysis of 87,116,501 GenBank sequence files reveals that 42% are associated with a publication or patent. Of these, 71% are associated with PMIDs, and can therefore be linked to a citation record in the PubMed database. The remaining (29%) of publication-associated GenBank entries either do not have PMIDs or cite a publication that is not currently indexed by PubMed. We also identify the journal titles that are linked through citations in the GenBank files to the largest number of sequences.ConclusionOur analysis suggests that GenBank contains molecular sequences from a range of disciplines beyond biomedicine, the initial scope of PubMed. The findings thus suggest opportunities to develop mechanisms for integrating biological knowledge beyond the biomedical field.

show abstract

Taxonomic Indexing—Extending the Role of Taxonomy

Patterson

Remsen

Marino

et al. 2006

View full text Add to dashboard Cite

Taxonomic indexing refers to a new array of taxonomically intelligent network services that use nomenclatural principles and elements of expert taxonomic knowledge to manage information about organisms. Taxonomic indexing was introduced to help manage the increasing amounts of digital information about biology. It has been designed to form a near basal layer in a layered cyberinfrastructure that deals with biological information. Taxonomic Indexing accommodates the special problems of using names of organisms to index biological material. It links alternative names for the same entity (reconciliation), and distinguishes between uses of the same name for different entities (disambiguation), and names are placed within an indefinite number of hierarchical schemes. In order to access all information on all organisms, Taxonomic indexing must be able to call on a registry of all names in all forms for all organisms. NameBank has been developed to meet that need. Taxonomic indexing is an area of informatics that overlaps with taxonomy, is dependent on the expert input of taxonomists, and reveals the relevance of the discipline to a wide audience.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Cathy N. Norton

NetiNeti: discovery of scientific names from text using machine learning methods

uBioRSS: Tracking taxonomic literature using RSS

GenBank and PubMed: How connected are they?

Taxonomic Indexing—Extending the Role of Taxonomy

Contact Info

Product

Resources

About