Existing annotation paradigms rely on controlled vocabularies, where each data instance is classified into one term from a predefined set of controlled vocabularies. This paradigm restricts the analysis to concepts that are known and well-characterized. Here, we present the novel multilingual translation method BioTranslator to address this problem. BioTranslator takes a user-written textual description of a new concept and then translates this description to a non-text biological data instance. The key idea of BioTranslator is to develop a multilingual translation framework, where multiple modalities of biological data are all translated to text. We demonstrate how BioTranslator enables the identification of novel cell types using only a textual description and how BioTranslator can be further generalized to protein function prediction and drug target identification. Our tool frees scientists from limiting their analyses within predefined controlled vocabularies, enabling them to interact with biological data using free text.
Motivation The exponential growth of genomic sequencing data has created ever-expanding repositories of gene networks. Unsupervised network integration methods are critical to learn informative representations for each gene, which are later used as features for downstream applications. However, these network integration methods must be scalable to account for the increasing number of networks and robust to an uneven distribution of network types within hundreds of gene networks. Results To address these needs, we present Gemini, a novel network integration method that uses memory-efficient high-order pooling to represent and weight each network according to its uniqueness. Gemini then mitigates the uneven network distribution through mixing up existing networks to create many new networks. We find that Gemini leads to more than a 10% improvement in F1 score, 15% improvement in micro-AUPRC, and 63% improvement in macro-AUPRC for human protein function prediction by integrating hundreds of networks from BioGRID, and that Gemini’s performance significantly improves when more networks are added to the input network collection, while Mashup and BIONIC embeddings’ performance deteriorates. Gemini thereby enables memory-efficient and informative network integration for large gene networks and can be used to massively integrate and analyze networks in other domains. Availability and implementation Gemini can be accessed at: https://github.com/MinxZ/Gemini.
The exponential growth of genomic sequencing data has created ever-expanding repositories of gene networks. Unsupervised network integration methods are critical to learn informative representations for each gene, which are later used as features for downstream applications. However, these network integration methods must be scalable to account for the increasing number of networks and robust to an uneven distribution of network types within hundreds of gene networks. To address these needs, we present Gemini, a novel network integration method that uses memory-efficient high-order pooling to represent and weight each network according to its uniqueness. Gemini then mitigates the uneven distribution through mixing up existing networks to create many new networks. We find that Gemini leads to more than a 10% improvement in F_1 score, 14% improvement in micro-AUPRC, and 71% improvement in macro-AURPC for protein function prediction by integrating hundreds of networks from BioGRID, and that Gemini's performance significantly improves when more networks are added to the input network collection, while the comparison approach's performance deteriorates. Gemini thereby enables memory-efficient and informative network integration for large gene networks, and can be used to massively integrate and analyze networks in other domains.
Drug combination therapy is a promising solution to many complicated diseases. Since experimental measurements cannot be scaled to millions of candidate combinations, many computational approaches have been developed to identify synergistic drug combinations. While most of the existing approaches either use SMILES-based features or molecular-graph- based features to represent drugs, we found that neither of these two feature modalities can comprehensively characterize a pair of drugs, necessitating the integration of these two types of features. Here, we propose Pisces, a cross-modal contrastive learning approach for synergistic drug combination prediction. The key idea of our approach is to model the combination of SMILES and molecular graphs as four views of a pair of drugs, and then apply contrastive learning to embed these four views closely to obtain high-quality drug pair embeddings. We evaluated Pisces on a recently released GDSC-Combo dataset, including 102,893 drug combinations and 125 cell lines. Pisces outperformed five existing drug combination prediction approaches under three settings, including vanilla cross validation, stratified cross validation for drug combinations, and stratified cross validation for cell lines. Our case study and ablation studies further confirmed the effectiveness of our novel contrastive learning framework and the importance of integrating the SMILES-based features and the molecular-graph-based features. Pisces has obtained the state-of-the-art results on drug synergy prediction and can be potentially used to model other pairs of drugs applications, such as drug-drug interaction. Availability: Implementation of Pisces and comparison approaches can be accessed at https://github.com/linjc16/Pisces.
Understanding the temporal dynamics of gene expression is crucial for developmental biology, tumor biology, and biogerontology. However, some time points remain challenging to measure in the lab, particularly during very early or very late stages in a biological process. Here we propose Sagittarius, a transformer-based model that is able to accurately simulate gene expression profiles at time points outside of the range of times measured in the lab. The key idea behind Sagittarius is to learn a shared reference space that generates simulated time series measurements, thereby explicitly modeling unaligned time points and conditional batch effects between time series and making the model widely applicable to diverse biological settings. We show the promising performance of Sagittarius when extrapolating mammalian developmental gene expression, simulating drug-induced expression at unmeasured dose and treatment times, and augmenting datasets to accurately predict drug sensitivity. We also used Sagittarius to simulate mutation profiles for early-stage cancer patients, which further enabled us to discover a gene set related to the Hedgehog signaling pathway that may be related to tumorigenesis in sarcoma patients, including PTCH1, ARID2, and MYCBP2. By augmenting experimental temporal datasets with crucial but difficult-to-measure simulated datapoints, Sagittarius enables deeper insights into the temporal dynamics of heterogeneous transcriptomic processes and can be broadly applied to biological time series extrapolation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.