T-cells play an essential role in the adaptive immune system by seeking out, binding and destroying foreign antigens presented on the cell surface of diseased cells. An improved understanding of T-cell immunity will greatly aid in the development of new cancer immunotherapies and vaccines for life threatening pathogens. Central to the design of such targeted therapies are computational methods to predict non-native epitopes to elicit a T cell response, however, we currently lack accurate immunogenicity inference methods. Another challenge is the ability to accurately simulate immunogenic peptides for specific human leukocyte antigen (HLA) alleles, for both synthetic biological applications and to augment real training datasets. Here, we proposed a beta-binomial distribution approach to derive epitope immunogenic potential from sequence alone. We conducted systematic benchmarking of five traditional machine learning (ElasticNet, KNN, SVM, Random Forest, AdaBoost) and three deep learning models (CNN, ResNet, GNN) using three independent prior validated immunogenic peptide collections (dengue virus, cancer neoantigen and SARS-Cov-2). We chose the CNN model as the best prediction model based on its adaptivity for small and large datasets, and performance relative to existing methods. In addition to outperforming two highly used immunogenicity prediction algorithms, DeepHLApan and IEDB, DeepImmuno-CNN further correctly predicts which residues are most important for T cell antigen recognition. Our independent generative adversarial network (GAN) approach, DeepImmuno-GAN, was further able to accurately simulate immunogenic peptides with physiochemical properties and immunogenicity predictions similar to that of real antigens. We provide DeepImmuno-CNN as source code and an easy-to-use web interface.Data AvailabilityDeepImmuno Python3 code is available at https://github.com/frankligy/DeepImmuno. The DeepImmuno web portal is available from https://deepimmuno.herokuapp.com. The data in this article is available in GitHub and supplementary materials.
Decisively delineating cell identities from uni- and multimodal single-cell datasets is complicated by diverse modalities, clustering methods, and reference atlases. We describe scTriangulate, a computational framework to mix-and-match multiple clustering results, modalities, associated algorithms, and resolutions to achieve an optimal solution. Rather than ensemble approaches which select the “consensus”, scTriangulate picks the most stable solution through coalitional iteration. When evaluated on diverse multimodal technologies, scTriangulate outperforms alternative approaches to identify high-confidence cell-populations and modality-specific subtypes. Unlike existing integration strategies that rely on modality-specific joint embedding or geometric graphs, scTriangulate makes no assumption about the distributions of raw underlying values. As a result, this approach can solve unprecedented integration challenges, including the ability to automate reference cell-atlas construction, resolve clonal architecture within molecularly defined cell-populations and subdivide clusters to discover splicing-defined disease subtypes. scTriangulate is a flexible strategy for unified integration of single-cell or multimodal clustering solutions, from nearly unlimited sources.
Cells and tissues respond to perturbations in multiple ways that can be sensitively reflected in the alterations of gene expression. Current approaches to finding and quantifying the effects of perturbations on cell-level responses over time disregard the temporal consistency of identifiable gene programs. To leverage the occurrence of these patterns for perturbation analyses, we developed CellDrift (https://github.com/KANG-BIOINFO/CellDrift), a generalized linear model-based functional data analysis method that is capable of identifying covarying temporal patterns of various cell types in response to perturbations. As compared to several other approaches, CellDrift demonstrated superior performance in the identification of temporally varied perturbation patterns and the ability to impute missing time points. We applied CellDrift to multiple longitudinal datasets, including COVID-19 disease progression and gastrointestinal tract development, and demonstrated its ability to identify specific gene programs associated with sequential biological processes, trajectories and outcomes.
Cells and tissues respond to perturbations in multiple ways that can be sensitively reflected in alterations of gene expression. Current approaches to finding and quantifying the effects of perturbations on cell-level responses over time disregard the temporal consistency of identifiable gene programs. To leverage the occurrence of these patterns for perturbation analyses, we developed CellDrift (https://github.com/KANG-BIOINFO/CellDrift), a generalized linear model-based functional data analysis method capable of identifying covarying temporal patterns of various cell types in response to perturbations. As compared to several other approaches, CellDrift demonstrated superior performance in the identification of temporally varied perturbation patterns and the ability to impute missing time points. We applied CellDrift to multiple longitudinal datasets, including COVID-19 disease progression and gastrointestinal tract development, and demonstrated its ability to identify specific gene programs associated with sequential biological processes, trajectories, and outcomes.
In diseases such as cancer, the design of new therapeutic strategies requires extensive, costly, and unfortunately sometimes deadly testing to reveal life threatening off target effects. A crucial first step in predicting toxicity are analyses of normal RNA and protein tissue expression, which are now possible using comprehensive molecular tissue atlases. However, no standardized approaches exist for target prioritization, which instead rely on ad-hoc thresholds and manual inspection. Such issues are compounded, given that genomic and proteomic data detection sensitivity and accuracy are often problematic. Thus, quantifiable probabilistic scores for tumor specificity that address these challenges could enable the creation of new predictive models for combinatorial drug design and correlative analyses. Here, we propose a Bayesian Tumor Specificity (BayesTS) score that can naturally account for multiple independent forms of molecular evidence derving from both RNA-Seq and protein expression while preserving the uncertainty of the inference. We applied BayesTS to 24,905 human protein-coding genes across 3,644 normal samples (GTEx and TCGA) spanning 63 tissues. These analyses demonstrate the ability of BayesTS to accurately incorporate protein, RNA and tissue distribution evidence, while effectively capturing the uncertainty of these inferences. This approach prioritized well-established drug targets, while deemphasizing those which were later found to induce toxicity. BayesTS allows for the adjustment of tissue importance weights for tissues of interest, such as reproductive and physiologically dispensable tissues (e.g., tonsil, appendix), enabling clinically translatable prioritizations. Our results show that BayesTS can facilitate novel drug target discovery and can be easily generalized to unconventional molecular targets, such as splicing neoantigens. We provide the code and inferred tumor specificity predictions as a database available online (https://github.com/frankligy/BayesTS). We envision that the widespread adoption of BayesTS will facilitate improved target prioritization for oncology drug development, ultimately leading to the discovery of more effective and safer drugs.
Hundreds of bioinformatics approaches now exist to define cellular heterogeneity from single-cell genomics data. Reconciling conflicts between diverse methods, algorithm settings, annotations or modalities have the potential to clarify which populations are real and establish reusable reference atlases. Here, we present a customizable computational strategy called scTrianguate, which leverages cooperative game theory to intelligently mix-and-match clustering solutions from different resolutions, algorithms, reference atlases, or multi-modal measurements. This algorithm relies on a series of robust statistical metrics for cluster stability that work across molecular modalities to identify high-confidence integrated annotations. When applied to annotations from diverse competing cell atlas projects, this approach is able to resolve conflicts and determine the validity of controversial cell population predictions. Tested with scRNA-Seq, CITE-Seq (RNA + surface ADT), multiome (RNA + ATAC), and TEA-Seq (RNA + surface ADT + ATAC), this approach identifies highly stable and reproducible, known and novel cell populations, while excluding clusters defined by technical artifacts (i.e., doublets). Importantly, we find that distinct cell populations are frequently attributed with features from different modalities (RNA, ATAC, ADT) in the same assay, highlighting the importance of multimodal analysis in cluster determination. As it is flexible, this approach can be updated with new user-defined statistical metrics to alter the decision engine and customized to new measures of stability for different measures of cellular activity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.