Shayantan Banerjee scite author profile

A complicated clinical course for critically ill patients admitted to the intensive care unit (ICU) usually includes multiorgan dysfunction and subsequent death. Owing to the heterogeneity, complexity, and unpredictability of the disease progression, ICU patient care is challenging. Identifying the predictors of complicated courses and subsequent mortality at the early stages of the disease and recognizing the trajectory of the disease from the vast array of longitudinal quantitative clinical data is difficult. Therefore, we attempted to perform a meta-analysis of previously published gene expression datasets to identify novel early biomarkers and train the artificial intelligence systems to recognize the disease trajectories and subsequent clinical outcomes. Using the gene expression profile of peripheral blood cells obtained within 24 h of pediatric ICU (PICU) admission and numerous clinical data from 228 septic patients from pediatric ICU, we identified 20 differentially expressed genes predictive of complicated course outcomes and developed a new machine learning model. After 5-fold cross-validation with 10 iterations, the overall mean area under the curve reached 0.82. Using a subset of the same set of genes, we further achieved an overall area under the curve of 0.72, 0.96, 0.83, and 0.82, respectively, on four independent external validation sets. This model was highly effective in identifying the clinical trajectories of the patients and mortality. Artificial intelligence systems identified eight out of twenty novel genetic markers (SDC4, CLEC5A, TCN1, MS4A3, HCAR3, OLAH, PLCB1, and NLRP1) that help predict sepsis severity or mortality. While these genes have been previously associated with sepsis mortality, in this work, we show that these genes are also implicated in complex disease courses, even among survivors. The discovery of eight novel genetic biomarkers related to the overactive innate immune system, including neutrophil function, and a new predictive machine learning method provides options to effectively recognize sepsis trajectories, modify real-time treatment options, improve prognosis, and patient survival.

show abstract

Sequence Neighborhoods Enable Reliable Prediction of Pathogenic Mutations in Cancer Genomes

Banerjee

Raman

Ravindran

2021

Cancers

View full text Add to dashboard Cite

Identifying cancer-causing mutations from sequenced cancer genomes hold much promise for targeted therapy and precision medicine. “Driver” mutations are primarily responsible for cancer progression, while “passengers” are functionally neutral. Although several computational approaches have been developed for distinguishing between driver and passenger mutations, very few have concentrated on using the raw nucleotide sequences surrounding a particular mutation as potential features for building predictive models. Using experimentally validated cancer mutation data in this study, we explored various string-based feature representation techniques to incorporate information on the neighborhood bases immediately 5′ and 3′ from each mutated position. Density estimation methods showed significant distributional differences between the neighborhood bases surrounding driver and passenger mutations. Binary classification models derived using repeated cross-validation experiments provided comparable performances across all window sizes. Integrating sequence features derived from raw nucleotide sequences with other genomic, structural, and evolutionary features resulted in the development of a pan-cancer mutation effect prediction tool, NBDriver, which was highly efficient in identifying pathogenic variants from five independent validation datasets. An ensemble predictor obtained by combining the predictions from NBDriver with three other commonly used driver prediction tools (FATHMM (cancer), CONDEL, and MutationTaster) significantly outperformed existing pan-cancer models in prioritizing a literature-curated list of driver and passenger mutations. Using the list of true positive mutation predictions derived from NBDriver, we identified a list of 138 known driver genes with functional evidence from various sources. Overall, our study underscores the efficacy of using raw nucleotide sequences as features to distinguish between driver and passenger mutations from sequenced cancer genomes.

show abstract

Machine Learning Identifies Complicated Sepsis Course and Subsequent Mortality Based on 20 Genes in Peripheral Blood Immune Cells at 24 Hours post ICU admission

Banerjee

Mohammed

Wong

et al. 2020

Preprint

View full text Add to dashboard Cite

A complicated clinical course for critically ill patients admitted to the ICU usually includes multiorgan dysfunction and subsequent death. Owning to the heterogeneity, complexity, and unpredictability of the disease progression, patient care is challenging. Identifying the predictors of complicated courses and subsequent mortality at the early stages of the disease and recognizing the trajectory of the disease from the vast array of longitudinal quantitative clinical data is difficult. Therefore, we attempted to identify novel early biomarkers and train the artificial intelligence systems to recognize the disease trajectories and subsequent clinical outcomes. Using the gene expression profile of peripheral blood cells obtained within 24 hours of PICU admission and numerous clinical data from 228 septic patients from pediatric ICU, we identified 20 differentially expressed genes that were predictive of complicated course outcomes and developed a new machine learning model. After 5-fold cross-validation with ten iterations, the overall mean area under the curve reached 0.82. This model was highly effective in identifying the clinical trajectories of the patients and mortality. Artificial intelligence systems identified eight out of twenty novel genetic markers SDC4 , CLEC5A , TCN1 , MS4A3 , HCAR3 , OLAH , PLCB1 and NLRP1 that help to predict sepsis severity or mortality. The discovery of eight novel genetic biomarkers related to the overactive innate immune system and neutrophils functions, and a new predictive machine learning method provides options to effectively recognize sepsis trajectories, modify real-time treatment options, improve prognosis, and patient survival.

show abstract

Sequence neighborhoods enable reliable prediction of pathogenic mutations in cancer genomes

Banerjee

Raman

Ravindran

2021

Preprint

View full text Add to dashboard Cite

Identifying cancer-causing mutations from sequenced cancer genomes hold much promise for targeted therapy and precision medicine. "Driver" mutations are primarily responsible for cancer progression, while "passengers" are functionally neutral. Although several computational approaches have been developed for distinguishing between driver and passenger mutations, very few have concentrated on utilizing the raw nucleotide sequences surrounding a particular mutation as potential features for building predictive models. Using experimentally validated cancer mutation data in this study, we explored various string-based feature representation techniques to incorporate information on the neighborhood bases immediately 5' and 3' from each mutated position. Density estimation methods showed significant distributional differences between the neighborhood bases surrounding driver and passenger mutations. Binary classification models derived using repeated cross-validation experiments gave comparable performances across all window sizes. Integrating sequence features derived from raw nucleotide sequences with other genomic, structural, and evolutionary features resulted in the development of a pan-cancer mutation effect prediction tool, NBDriver, which was highly efficient in identifying pathogenic variants from five independent validation datasets. An ensemble predictor obtained by combining the predictions from NBDriver with two other commonly used driver prediction tools (CONDEL and Mutation Taster) outperformed existing pan-cancer models in prioritizing a literature-curated list of driver and passenger mutations. Using the list of true positive mutation predictions derived from NBDriver, we identified a list of 138 known driver genes with functional evidence from various sources. Overall, our study underscores the efficacy of utilizing raw nucleotide sequences as features to distinguish between driver and passenger mutations from sequenced cancer genomes.

show abstract

Machine learning driven identification of gene-expression signatures correlated with multiple organ dysfunction trajectories and complex sub-endotypes of pediatric septic shock

Atreya

Banerjee

Lautz

et al. 2022

Preprint

View full text Add to dashboard Cite

Background Multiple organ dysfunction syndrome (MODS) disproportionately drives sepsis morbidity and mortality among children. The biology of this heterogeneous syndrome is complex, dynamic, and incompletely understood. Gene expression signatures correlated with MODS trajectories may facilitate identification of molecular targets and predictive enrichment. Methods Secondary analyses of publicly available datasets. (1) Supervised machine learning (ML) was used to identify genes correlated with persistent MODS relative to those without in the derivation cohort. Model performances were tested across 4 validation cohorts, among children and adults with differing inciting cause for organ dysfunctions, to identify a stable set of genes and fixed classification model to reliably estimate the risk of MODS. Clinical propensity scores, where available, were used to enhance model performance. (2) We identified organ-specific dysfunction signatures by eliminating redundancies between the shared MODS signature and those of individual organ dysfunctions. (3) Finally, novel patient subclasses were identified through unsupervised hierarchical clustering of genes correlated with persistent MODS and compared with previously established pediatric septic shock endotypes. Results 568 genes were differentially expressed, among which ML identified 109 genes that were consistently correlated with persistent MODS. The AUROC of a model that incorporated the stable features chosen from repeated cross-validation experiments to estimate risk of MODS was 0.87 (95% CI: 0.85–0.88). Model performance using the top 20 genes and an ExtraTree classification model yielded AUROCs ranging 0.77–0.96 among validation cohorts. Genes correlated with day 3 and 7 cardiovascular, respiratory, and renal dysfunctions were identified. Finally, the top 50 genes were used to discover four novel subclasses, of which patients belonging to M1 and M2 had the worst clinical outcomes. Reactome pathway analyses revealed a potential role of transcription factor RUNX1 in distinguishing subclasses. Interaction with receipt of adjuvant steroids suggested that newly derived M1 and M2 endotypes were biologically distinct relative to established endotypes. Conclusions Our data suggest the existence of complex sub-endotypes among children with septic shock wherein overlapping biological pathways may be linked to differential response to therapies. Future studies in cohorts enriched for patients with MODS may facilitate discovery and development of disease modifying therapies for subsets of critically ill children with sepsis.

show abstract

iCOMIC: a graphical interface-driven bioinformatics pipeline for analyzing cancer omics data

Sithara

Maripuri

Moorthy

et al. 2021

Preprint

View full text Add to dashboard Cite

Despite the tremendous increase in omics data generated by modern sequencing technologies, their analysis can be tricky and often requires substantial expertise in bioinformatics. To address this concern, we have developed a user-friendly pipeline to analyze (cancer) genomic data that takes in raw sequencing data (FASTQ format) as input and outputs insightful statistics on the nature of the data. Our iCOMIC toolkit pipeline can analyze whole-genome and transcriptome data and is embedded in the popular Snakemake workflow management system. iCOMIC is characterized by a user-friendly GUI that offers several advantages, including executing analyses with minimal steps, eliminating the need for complex command-line arguments. The toolkit features many independent core workflows for both whole genomic and transcriptomic data analysis. Even though all the necessary, well-established tools are integrated into the pipeline to enable "out-of-the-box" analysis, we provide the user with the means to replace modules or alter the pipeline as needed. Notably, we have integrated algorithms developed in-house for predicting driver and passenger mutations based on mutational context and tumor suppressor genes and oncogenes from somatic mutation data. We benchmarked our tool against Genome In A Bottle (GIAB) benchmark dataset (NA12878) and got the highest F1 score of 0.971 and 0.988 for indels and SNPs, respectively, using the BWA MEM - GATK HC DNA-Seq pipeline. Similarly, we achieved a correlation coefficient of r=0.85 using the HISAT2-StringTie-ballgown and STAR-StringTie-ballgown RNA-Seq pipelines on the human monocyte dataset (SRP082682). Overall, our tool enables easy analyses of omics datasets, with minimal steps, significantly ameliorating complex data analysis pipelines. Availability: https://github.com/RamanLab/iCOMIC

show abstract

iCOMIC: a graphical interface-driven bioinformatics pipeline for analyzing cancer omics data

Sithara

Maripuri

Moorthy

et al. 2022

View full text Add to dashboard Cite

Despite the tremendous increase in omics data generated by modern sequencing technologies, their analysis can be tricky and often requires substantial expertise in bioinformatics. To address this concern, we have developed a user-friendly pipeline to analyze (cancer) genomic data that takes in raw sequencing data (FASTQ format) as input and outputs insightful statistics. Our iCOMIC toolkit pipeline featuring many independent workflows is embedded in the popular Snakemake workflow management system. It can analyze whole-genome and transcriptome data and is characterized by a user-friendly GUI that offers several advantages, including minimal execution steps and eliminating the need for complex command-line arguments. Notably, we have integrated algorithms developed in-house to predict pathogenicity among cancer-causing mutations and differentiate between tumor suppressor genes and oncogenes from somatic mutation data. We benchmarked our tool against Genome In A Bottle benchmark dataset (NA12878) and got the highest F1 score of 0.971 and 0.988 for indels and SNPs, respectively, using the BWA MEM—GATK HC DNA-Seq pipeline. Similarly, we achieved a correlation coefficient of r = 0.85 using the HISAT2-StringTie-ballgown and STAR-StringTie-ballgown RNA-Seq pipelines on the human monocyte dataset (SRP082682). Overall, our tool enables easy analyses of omics datasets, significantly ameliorating complex data analysis pipelines.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.