Recapitulation and Retrospective Prediction of Biomedical Associations Using Temporally-enabled Word Embeddings

Park, Jiho; Marquez, Agustin Lopez; Puranik, Arjun; Rajasekharan, Ajit; Aravamudan, Murali; Garcia-Rivera, Enrique

doi:10.1101/627513

Cited by 9 publications

(13 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to capture biomedical literature-based associations, the nferX platform defines two scores: a ‘local score’ and a ‘global score’, as described previously ( Park et al, 2020 ). Briefly, the local score is obtained from applying a traditional natural language processing technique which captures the strength of association between two concepts in a selected corpus of biomedical literature based on the frequency of their co-occurrence normalized by the frequency of each individual concept throughout the corpus.…”

Section: Methodsmentioning

confidence: 99%

“…Note: One key drawback of the word2vec vector cosine similarity ( Park et al, 2020 ; Mikolov et al, 2013b ) method is its inability to get scores for logical queries as described above, because the method ( Mikolov et al, 2013b ) does not address the question of how to get vectors for queries that are logical combinations of tokens.…”

Section: Methodsmentioning

confidence: 99%

“…We use sets of known pairs of related entities versus a "control" group of random pairs of entities of the same classes. We use a few different sets of known pairs: (1) Disease-Gene relationships based on OMIM 58 (2) Drug-Gene relationships (3) Drug-Disease relationships based on fda labels (a) Drugs and their on-label indications (b) Drugs and their on-label adverse events (4) Logical queries for ambiguous tokens One demonstration of the use of the logical query system is to disambiguate a token by conjoining it with a disambiguating token. An example is clearer: the token "egfr" can refer to the gene entity epidermal growth factor receptor, but also the test measure entity estimated glomerular filtration rate.…”

Section: Evaluation Of Literature-derived Association Scores Ground Tmentioning

confidence: 99%

“…We used an internal set of ~200-300 such ("A AND B", "C") pairs (originally built up for other reasons). Note: One key drawback of the word2vec vector cosine similarity 55,59 method is its inability to get scores for logical queries as described above, because the method 59 does not address the question of how to get vectors for queries that are logical combinations of tokens. Evaluation metrics Given a scoring method and a particular set of positive/control pairs, we get two sets of scores: one set for the positive pairs and one set for the negative pairs.…”

Section: Evaluation Of Literature-derived Association Scores Ground Tmentioning

confidence: 99%

See 3 more Smart Citations

Knowledge synthesis of 100 million biomedical documents augments the deep expression profiling of coronavirus receptors

Venkatakrishnan

Puranik

Anand

et al. 2020

eLife

Self Cite

View full text Add to dashboard Cite

The COVID-19 pandemic demands assimilation of all biomedical knowledge to decode mechanisms of pathogenesis. Despite the recent renaissance in neural networks, a platform for the real-time synthesis of the exponentially growing biomedical literature and deep omics insights is unavailable. Here, we present the nferX platform for dynamic inference from over 45 quadrillion possible conceptual associations from unstructured text, and triangulation with insights from single-cell RNA-sequencing, bulk RNA-seq and proteomics from diverse tissue types. A hypothesis-free profiling of ACE2 suggests tongue keratinocytes, olfactory epithelial cells, airway club cells and respiratory ciliated cells as potential reservoirs of the SARS-CoV-2 receptor. We find the gut as the putative hotspot of COVID-19, where a maturation correlated transcriptional signature is shared in small intestine enterocytes among coronavirus receptors (ACE2, DPP4, ANPEP). A holistic data science platform triangulating insights from structured and unstructured data holds potential for accelerating the generation of impactful biological insights and hypotheses.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Evaluation Of Literature-derived Association Scores Ground Tmentioning

confidence: 99%

Section: Evaluation Of Literature-derived Association Scores Ground Tmentioning

confidence: 99%

See 2 more Smart Citations

Knowledge synthesis of 100 million biomedical documents augments the deep expression profiling of coronavirus receptors

Venkatakrishnan

Puranik

Anand

et al. 2020

eLife

Self Cite

View full text Add to dashboard Cite

show abstract

“…In order to capture biomedical literature based associations, the nferX platform defines two scores: a "local score" and a "global score", as described previously 51 . Briefly, the local score represents a traditional natural language processing technique which captures the strength of association between two concepts in a selected corpus of biomedical literature based on the frequency of their co-occurrence normalized by the frequency of each individual concept throughout the corpus.…”

Section: Unstructured Biomedical Knowledge Synthesis and Triangulatiomentioning

confidence: 99%

Knowledge synthesis from 100 million biomedical documents augments the deep expression profiling of coronavirus receptors

Venkatakrishnan

Puranik

Anand

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

The COVID-19 pandemic demands assimilation of all available biomedical knowledge to decode its mechanisms of pathogenicity and transmission. Despite the recent renaissance in unsupervised neural networks for decoding unstructured natural languages, a platform for the real-time synthesis of the exponentially growing biomedical literature and its comprehensive triangulation with deep omic insights is not available. Here, we present the nferX platform for dynamic inference from over 45 quadrillion possible conceptual associations extracted from unstructured biomedical text, and their triangulation with Single Cell RNA-sequencing based insights from over 25 tissues. Using this platform, we identify intersections between the pathologic manifestations of COVID-19 and the comprehensive expression profile of the SARS-CoV-2 receptor ACE2. We find that tongue keratinocytes, airway club cells, and ciliated cells are likely underappreciated targets of SARS-CoV-2 infection, in addition to type II pneumocytes and olfactory epithelial cells. We further identify mature small intestinal enterocytes as a possible hotspot of COVID-19 fecal-oral transmission, where an intriguing maturation-correlated transcriptional signature is shared between ACE2 and the other coronavirus receptors DPP4 (MERS-CoV) and ANPEP (ɑ-coronavirus). This study demonstrates how a holistic data science platform can leverage unprecedented quantities of structured and unstructured publicly available data to accelerate the generation of impactful biological insights and hypotheses.The nferX Platform Single-cell resource -https://academia.nferx.com/

show abstract

Predicting cross-tissue hormone–gene relations using balanced word embeddings

et al. 2022

View full text Add to dashboard Cite

Motivation Inter-organ/inter-tissue communication is central to multi-cellular organisms including humans, and mapping inter-tissue interactions can advance system-level whole-body modeling efforts. Large volumes of biomedical literature have fostered studies that map within-tissue or tissue-agnostic interactions, but literature mining studies that infer inter-tissue relations such as between hormones and genes are solely missing. Results We present a first study to predict from biomedical literature the hormone-gene associations mediating inter-tissue signaling in the human body. Our BioEmbedS* models use neural network based Biomedical word Embeddings with a Support Vector Machine classifier to predict if a hormone-gene pair is associated or not, and whether an associated gene is involved in the hormone's production or response. Model training relies on our unified dataset HGv1 (Hormone-Gene version 1) of ground-truth associations between genes and endocrine hormones, which we compiled and carefully balanced in the embedded space to handle data disparities such as between poorly- vs. well-studied hormones. Our BioEmbedS model recapitulates known gene mediators of tissue-tissue signaling with 70.4% accuracy; predicts novel inter-tissue communication genes in humans which are enriched for hormone-related disorders; and generalizes well to mouse, thereby holding promise for its extension to other multi-cellular organisms as well. Availability Freely available at https://cross-tissue-signaling.herokuapp.com are our model predictions & datasets; https://github.com/BIRDSgroup/BioEmbedS has all relevant code. Supplemental Information Supplementary information available at Bioinformatics online.

show abstract

Recapitulation and Retrospective Prediction of Biomedical Associations Using Temporally-enabled Word Embeddings

Cited by 9 publications

References 14 publications

Knowledge synthesis of 100 million biomedical documents augments the deep expression profiling of coronavirus receptors

Knowledge synthesis of 100 million biomedical documents augments the deep expression profiling of coronavirus receptors

Knowledge synthesis from 100 million biomedical documents augments the deep expression profiling of coronavirus receptors

Predicting cross-tissue hormone–gene relations using balanced word embeddings

Contact Info

Product

Resources

About