BackgroundHigh-throughput sequencing technology and bioinformatics have identified chimeric RNAs (chRNAs), raising the possibility of chRNAs expressing particularly in diseases can be used as potential biomarkers in both diagnosis and prognosis.ResultsThe task of discriminating true chRNAs from the false ones poses an interesting Machine Learning (ML) challenge. First of all, the sequencing data may contain false reads due to technical artifacts and during the analysis process, bioinformatics tools may generate false positives due to methodological biases. Moreover, if we succeed to have a proper set of observations (enough sequencing data) about true chRNAs, chances are that the devised model can not be able to generalize beyond it. Like any other machine learning problem, the first big issue is finding the good data to build models. As far as we were concerned, there is no common benchmark data available for chRNAs detection. The definition of a classification baseline is lacking in the related literature too. In this work we are moving towards benchmark data and an evaluation of the fidelity of supervised classifiers in the prediction of chRNAs.ConclusionsWe proposed a modelization strategy that can be used to increase the tools performances in context of chRNA classification based on a simulated data generator, that permit to continuously integrate new complex chimeric events. The pipeline incorporated a genome mutation process and simulated RNA-seq data. The reads within distinct depth were aligned and analysed by CRAC that integrates genomic location and local coverage, allowing biological predictions at the read scale. Additionally, these reads were functionally annotated and aggregated to form chRNAs events, making it possible to evaluate ML methods (classifiers) performance in both levels of reads and events. Ensemble learning strategies demonstrated to be more robust to this classification problem, providing an average AUC performance of 95 % (ACC=94 %, Kappa=0.87 %). The resulting classification models were also tested on real RNA-seq data from a set of twenty-seven patients with acute myeloid leukemia (AML).Electronic supplementary materialThe online version of this article (doi:10.1186/s13040-016-0112-6) contains supplementary material, which is available to authorized users.
BackgroundThe evolution of next-generation sequencing (NGS) technologies has led to increased focus on RNA-Seq. Many bioinformatic tools have been developed for RNA-Seq analysis, each with unique performance characteristics and configuration parameters. Users face an increasingly complex task in understanding which bioinformatic tools are best for their specific needs and how they should be configured. In order to provide some answers to these questions, we investigate the performance of leading bioinformatic tools designed for RNA-Seq analysis and propose a methodology for systematic evaluation and comparison of performance to help users make well informed choices.ResultsTo evaluate RNA-Seq pipelines, we developed a suite of two benchmarking tools. SimCT generates simulated datasets that get as close as possible to specific real biological conditions accompanied by the list of genomic incidents and mutations that have been inserted. BenchCT then compares the output of any bioinformatics pipeline that has been run against a SimCT dataset with the simulated genomic and transcriptional variations it contains to give an accurate performance evaluation in addressing specific biological question. We used these tools to simulate a real-world genomic medicine question s involving the comparison of healthy and cancerous cells. Results revealed that performance in addressing a particular biological context varied significantly depending on the choice of tools and settings used. We also found that by combining the output of certain pipelines, substantial performance improvements could be achieved.ConclusionOur research emphasizes the importance of selecting and configuring bioinformatic tools for the specific biological question being investigated to obtain optimal results. Pipeline designers, developers and users should include benchmarking in the context of their biological question as part of their design and quality control process. Our SimBA suite of benchmarking tools provides a reliable basis for comparing the performance of RNA-Seq bioinformatics pipelines in addressing a specific biological question. We would like to see the creation of a reference corpus of data-sets that would allow accurate comparison between benchmarks performed by different groups and the publication of more benchmarks based on this public corpus. SimBA software and data-set are available at http://cractools.gforge.inria.fr/softwares/simba/.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-017-1831-5) contains supplementary material, which is available to authorized users.
High-throughput next generation sequencing (NGS)
Background: High-throughput next generation sequencing (NGS) technologies enable the detection of biomarkers used for tumor classification, disease monitoring and cancer therapy. Whole-transcriptome analysis using RNA-seq is important, not only as a means of understanding the mechanisms responsible for complex diseases but also to efficiently identify novel genes/exons, splice isoforms, RNA editing, allele-specific mutations, differential gene expression and fusion-transcripts or chimeric RNA (chRNA). Methods: We used Crac, a tool that uses genomic locations and local coverage to classify biological events and directly infer splice and chimeric junctions within a single read. Crac’s algorithm extracts transcriptional chimeric events irrespective of annotation with a high sensitivity, and CracTools was used to aggregate, annotate and filter the chRNA reads. The selected chRNA candidates were validated by real time PCR and sequencing. In order to check the tumor specific expression of chRNA, we analyzed a publicly available dataset using a new tag search approach. Results: We present data related to acute myeloid leukemia (AML) RNA-seq analysis. We highlight novel biological cases of chRNA, in addition to previously well characterized leukemia chRNA. We have identified and validated 17 chRNAs among 3 AML patients: 10 from an AML patient with a translocation between chromosomes 15 and 17 (AML-t(15;17), 4 from patient with normal karyotype (AML-NK) 3 from a patient with chromosomal 16 inversion (AML-inv16). The new fusion transcripts can be classified into four groups according to the exon organization. Conclusions: All groups suggest complex but distinct synthesis mechanisms involving either collinear exons of different genes, non-collinear exons, or exons of different chromosomes. Finally, we check tumor-specific expression in a larger RNA-seq AML cohort and identify new AML biomarkers that could improve diagnosis and prognosis of AML.
Precision medicine relies on recognizing clinically relevant groups of patients but is currently weakened by the lack of consistency in phenotyping. Inspired by medical inductive reasoning, we developed a machine-learning system which models symptom interactions in genetic diseases to overcome this phenotyping heterogeneity. Based on the Human Phenotype Ontology and using medical data from raw text, we determine 1,1 million associated pairs of symptoms in diseases. From 16,600 phenotypic traits, we defined 390 groups of interacting symptoms summarizing the overall clinical spectrum. The symptom interaction model allowed to cover 99.8% of given symptom-gene relationships from a cohort of 1,686 diagnosed patients. This method should enable new discoveries in precision medicine by standardizing clinical descriptions.
Observing phenotyping practices from an international cohort of 1,686 cases revealed heterogeneity of phenotype reporting among clinicians. Heterogeneity limited their exploitation for diagnosis as only 43% of symptom-gene associations in the cohort were available in public databases. We developed a symptom interaction model that summarized 16,600 terms into 390 groups of interacting symptoms and detected 3,222,053 novel symptom-gene associations. By learning phenotypic patterns in genetic diseases, symptom interaction modeling handled heterogeneity in phenotyping, to the extent of covering 99.8% of our cohort’s symptom-gene associations. Using these symptom interactions improved the diagnostic performance in gene prioritization by 42% (median rank 80 to 41) compared to the best algorithms. Symptom interaction modeling will provide new discoveries in precision medicine by standardizing clinical descriptions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.