We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have collectively been named 'PubGene'. We validated the extracted networks by three large-scale experiments showing that co-occurrence reflects biologically meaningful relationships, thus providing an approach to extract and structure known biology. We validated the applicability of the tools by analyzing two publicly available microarray data sets.
In a living cell, the antiparallel double-stranded helix of DNA is a dynamically changing structure. The structure relates to interactions between and within the DNA strands, and the array of other macromolecules that constitutes functional chromatin. It is only through its changing conformations that DNA can organize and structure a large number of cellular functions. In particular, DNA must locally uncoil, or melt, and become single-stranded for DNA replication, repair, recombination, and transcription to occur. It has previously been shown that this melting occurs cooperatively, whereby several base pairs act in concert to generate melting bubbles, and in this way constitute a domain that behaves as a unit with respect to local DNA single-strandedness. We have applied a melting map calculation to the complete human genome, which provides information about the propensities of forming local bubbles determined from the whole sequence, and present a first report on its basic features, the extent of cooperativity, and correlations to various physical and biological features of the human genome. Globally, the melting map covaries very strongly with GC content. Most importantly, however, cooperativity of DNA denaturation causes this correlation to be weaker at resolutions fewer than 500 bps. This is also the resolution level at which most structural and biological processes occur, signifying the importance of the informational content inherent in the genomic melting map. The human DNA melting map may be further explored at http://meltmap.uio.no.
In molecular biology there is much interest in various types of relationships between genes. Due to the complexity and rapid development of this field, much of this knowledge exists only in free-text form. A database of relationships between genes may allow background knowledge to be used in computerised analyses. As far as we know, no comprehensive manually cured database of this kind exists, and constructing and maintaining such a database manually would be very labour-intensive. Efficient automated methods for extraction and structuring of relationships between genes from free-text would be valuable. A database named PubGene has previously been created and it contains a comprehensive network of human genes created by automated extraction of co-occurrence of gene terms in over 10 million MEDLINE records. Co-occurring genes were linked together under the hypothesis that two genes will co-occur only if they have some biological relationship. In this paper, we show that for the subset of human genes encoding enzymes, pairs of co-occurring enzyme genes are significantly more closely related biologically than when these genes are compared randomly. Manual inspection, however, shows that some of the links in PubGene are not correct and it also indicates how the noise can be reduced. We propose a complementary method for automated extraction of relationships between genes by use of information from the Science Citation Index (SCI) database. We relate two genes if they have been co-referred, that is, having reference articles being co-cited in a third article. The alternative approach confirms relationships found in PubGene, and it also finds other relevant relationships.
A searchable Web interface, FigSearch, is accessible via http://pubgeneserver.uio.no/figsearch/ for all figures from the available corpus.
The advent of the so-called cDNA microarrays has offered the first possibility to obtain a global understanding of biological processes in living organisms by simultaneous readouts of tens of thousands of genes. Initial experiments suggest that genes with similar function have similar expression patterns in microarray experiments. Until now, most approaches to computational analysis of gene expressions have used unsupervised learning. Although in some cases unsupervised methods may be sufficient, the complexity of the biological processes is so high that it is unlikely that purely syntactical analyses are capable of fully exploiting the richness of the microarray data. In addition, it seems natural to re-use the existing biological (background) knowledge. In this paper, we present some elements of a methodology for knowledge discovery from microarray experiments. Two source of bio-medical knowledge are used: Ashburner's gene ontology and our own literature-derived network of gene-gene relations obtained by analysing Medline citation records. Predictive models can be induced and their classification quality validated through the ROC/AUC analysis and applied to provide hypotheses regarding the function of unclassified genes. The methodology has been so far tested on publicly available gene expression data and its results evaluated by molecular biologists and medical researchers.
Document collections resulting from searches in the biomedical literature, for instance, in PubMed, are often so large that some organization of the returned information is necessary. Clustering is an efficient tool for organizing search results. To help the user to decide how to continue the search for relevant documents, the content of each cluster can be characterized by a set of representative keywords or cluster labels. As different users may have different interests, it can be desirable with solutions that make it possible to produce labels from a selection of different topical categories. We therefore introduce the concept of multi-focus cluster labeling to give users the possibility to get an overview of the contents through labels from multiple viewpoints. The concept for multi-focus cluster labeling has been established and has been demonstrated on three different document collections. We illustrate that multi-focus visualizations can give an overview of clusters along axes that general labels are not able to convey. The approach is generic and should be applicable to any biomedical (or other) domain with any selection of foci where appropriate focus vocabularies can be established. A user evaluation also indicates that such a multi-focus concept is useful.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.