Background: The biological research literature is a major repository of knowledge. As the amount of literature increases, it will get harder to find the information of interest on a particular topic. There has been an increasing amount of work on text mining this literature, but comparing this work is hard because of a lack of standards for making comparisons. To address this, we worked with colleagues at the Protein Design Group, CNB-CSIC, Madrid to develop BioCreAtIvE (Critical Assessment for Information Extraction in Biology), an open common evaluation of systems on a number of biological text mining tasks. We report here on task 1A, which deals with finding mentions of genes and related entities in text. "Finding mentions" is a basic task, which can be used as a building block for other text mining tasks. The task makes use of data and evaluation software provided by the (US) National Center for Biotechnology Information (NCBI).
Background: Our goal in BioCreAtIve has been to assess the state of the art in text mining, with emphasis on applications that reflect real biological applications, e.g., the curation process for model organism databases. This paper summarizes the BioCreAtIvE task 1B, the "Normalized Gene List" task, which was inspired by the gene list supplied for each curated paper in a model organism database. The task was to produce the correct list of unique gene identifiers for the genes and gene products mentioned in sets of abstracts from three model organisms (Yeast, Fly, and Mouse).
Most C. elegans sensory neuron types consist of a single bilateral pair of neurons, and respond to a unique set of sensory stimuli. Although genes required for the development and function of individual sensory neuron types have been identified in forward genetic screens, these approaches are unlikely to identify genes that when mutated result in subtle or pleiotropic phenotypes. Here, we describe a complementary approach to identify sensory neuron type-specific genes via microarray analysis using RNA from sorted AWB olfactory and AFD thermosensory neurons. The expression patterns of subsets of these genes were further verified in vivo. Genes identified by this analysis encode 7-transmembrane receptors, kinases, and nuclear factors including dac-1, which encodes a homolog of the highly conserved Dachshund protein. dac-1 is expressed in a subset of sensory neurons including the AFD neurons and is regulated by the TTX-1 OTX homeodomain protein. On thermal gradients, dac-1 mutants fail to suppress a cryophilic drive but continue to track isotherms at the cultivation temperature, representing the first genetic separation of these AFD-mediated behaviors. Expression profiling of single neuron types provides a rapid, powerful, and unbiased method for identifying neuron-specific genes whose functions can then be investigated in vivo.
Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to the application of natural language processing to aid in the curation process for FlyBase. We focused on listing the normalized form of genes and gene products discussed in an article. We broke this into two steps: gene mention tagging in text, followed by normalization of gene names. For gene mention tagging, we adopted a statistical approach. To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions. We then evaluated the quality of the noisy training data (precision of 78%, recall 88%) and the quality of the HMM tagger output trained on this noisy data (precision 78%, recall 71%). In order to generate normalized gene lists, we explored two approaches. First, we explored simple pattern matching based on synonym lists to obtain a high recall/low precision system (recall 95%, precision 2%). Using a series of filters, we were able to improve precision to 50% with a recall of 72% (balanced F-measure of 0.59). Our second approach combined the HMM gene mention tagger with various filters to remove ambiguous mentions; this approach achieved an F-measure of 0.72 (precision 88%, recall 61%). These experiments indicate that the lexical resources provided by FlyBase are complete enough to achieve high recall on the gene list task, and that normalization requires accurate disambiguation; different strategies for tagging and normalization trade off recall for precision.
Neuronal identities are specified by the combinatorial functions of activators and repressors of gene expression. Members of the well-conserved Olf/EBF (O/E) transcription factor family have been shown to play important roles in neuronal and non-neuronal development and differentiation. O/E proteins are highly expressed in the olfactory epithelium, and O/E binding sites have been identified upstream of olfactory genes. However, the roles of O/E proteins in sensory neuron development are unclear. Here we show that the O/E protein UNC-3 is required for subtype specification of the ASI chemosensory neurons in Caenorhabditis elegans. UNC-3 promotes an ASI identity by directly repressing the expression of alternate neuronal programs and by activating expression of ASI-specific genes including the daf-7 TGF-beta gene. Our results indicate that UNC-3 is a critical component of the transcription factor code that integrates cell-intrinsic developmental programs with external signals to specify sensory neuronal identity and suggest models for O/E protein functions in other systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.