An abundance of biological data sources contain data on classes of scientific entities, such as genes and sequences. Logical relationships between scientific objects are implemented as URLs and foreign IDs. Query processing typically involves traversing links and paths (concatenation of links) through these sources. We model the data objects in these sources and the links between objects as an object graph. We identify a set of interesting properties for links and paths, such as outdegree, image of a link, cardinality of data objects and links, the number of distinct objects reached by some links, etc. Analogous to database cost models, we use statistics from the object graph to develop a framework to estimate the result size for a query on the object graph. Analogous to training and testing, we use sampled data from queries to estimate the result size. We validate our models using data sampled from four NIH/NCBI data sources. Our research provides a foundation for querying and exploring data sources.
An abundance of life sciences data sources contain data about scientific entities such as genes and sequences. Scientists are interested in exploring relationships between scientific objects, e.g., between genes and bibliographic citations. A scientist may choose the OMIM source, which contains information related to human genetic diseases, as a starting point for her exploration, and wish to eventually retrieve all related citations from the PUBMED source. Starting with a keyword search on a certain disease, she can explore all possible relationships between genes in OMIM and citations in PUBMED. This corresponds to the following query:
"Return all citations of
PUBMED
that are linked to an
OMIM
entry that is related to some disease or condition."
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.