Motivation High-throughput sequencing data can be affected by different technical errors, e.g. from probe preparation or false base calling. As a consequence, reproducibility of experiments can be weakened. In virus metagenomics, technical errors can result in falsely identified viruses in samples from infected hosts. We present a new resampling approach based on bootstrap sampling of sequencing reads from FASTQ-files in order to generate artificial replicates of sequencing runs which can help to judge the robustness of an analysis. In addition, we evaluate a mixture model on the distribution of read counts per virus to identify potentially false positive findings. Results The evaluation of our approach on an artificially generated data set with known viral sequence content shows in general a high reproducibility of uncovering viruses in sequencing data. I.e., the correlation between original and mean bootstrap read count was highly correlated. However, the bootstrap read counts can also indicate reduced or increased evidence for the presence of a virus in the biological sample. We also found that the mixture model fits well to the read counts, and furthermore, it provides a higher accuracy on the original or on the bootstrap read counts than on the difference between both. The usefulness of our methods is further demonstrated on two freely available real world data sets from harbour seals. Availability We provide a Phyton tool, called RESEQ, available from https://github.com/babaksaremi/RESEQ that allows efficient generation of bootstrap reads from an original FASTQ-file. Contact klaus.jung@tiho-hannover.de Supplementary information Supplementary data are available at Bioinformatics online.
To better understand the molecular basis of respiratory diseases of viral origin, high-throughput gene-expression data are frequently taken by means of DNA microarray or RNA-seq technology. Such data can also be useful to classify infected individuals by molecular signatures in the form of machine-learning models with genes as predictor variables. Early diagnosis of patients by molecular signatures could also contribute to better treatments. An approach that has rarely been considered for machine-learning models in the context of transcriptomics is data augmentation. For other data types it has been shown that augmentation can improve classification accuracy and prevent overfitting. Here, we compare three strategies for data augmentation of DNA microarray and RNA-seq data from two selected studies on respiratory diseases of viral origin. The first study involves samples of patients with either viral or bacterial origin of the respiratory disease, the second study involves patients with either SARS-CoV-2 or another respiratory virus as disease origin. Specifically, we reanalyze these public datasets to study whether patient classification by transcriptomic signatures can be improved when adding artificial data for training of the machine-learning models. Our comparison reveals that augmentation of transcriptomic data can improve the classification accuracy and that fewer genes are necessary as explanatory variables in the final models. We also report genes from our signatures that overlap with signatures presented in the original publications of our example data. Due to strict selection criteria, the molecular role of these genes in the context of respiratory infectious diseases is underlined.
Due to the overall high costs, technical replicates are usually omitted in RNA-seq experiments, but several methods exist to generate them artificially. Bootstrapping reads from FASTQ-files has recently been used in the context of other NGS analyses and can be used to generate artificial technical replicates. Bootstrapping samples from the columns of the expression matrix has already been used for DNA microarray data and generates a new artificial replicate of the whole experiment. Mixing data of individual samples has been used for data augmentation in machine learning. The aim of this comparison is to evaluate which of these strategies are best suited to study the reproducibility of differential expression and gene-set enrichment analysis in an RNA-seq experiment. To study the approaches under controlled conditions, we performed a new RNA-seq experiment on gene expression changes upon virus infection compared to untreated control samples. In order to compare the approaches for artificial replicates, each of the samples was sequenced twice, i.e. as true technical replicates, and differential expression analysis and GO term enrichment analysis was conducted separately for the two resulting data sets. Although we observed a high correlation between the results from the two replicates, there are still many genes and GO terms that would be selected from one replicate but not from the other. Cluster analyses showed that artificial replicates generated by bootstrapping reads produce it p values and fold changes that are close to those obtained from the true data sets. Results generated from artificial replicates with the approaches of column bootstrap or mixing observations were less similar to the results from the true replicates. Furthermore, the overlap of results among replicates generated by column bootstrap or mixing observations was much stronger than among the true replicates. Artificial technical replicates generated by bootstrapping sequencing reads from FASTQ-files are better suited to study the reproducibility of results from differential expression and GO term enrichment analysis in RNA-seq experiments than column bootstrap or mixing observations. However, FASTQ-bootstrapping is computationally more expensive than the other two approaches. The FASTQ-bootstrapping may be applicable to other applications of high-throughput sequencing.
Background Gene expression profiling is a widely adopted method in areas like drug development or functional gene analysis. Microarray data of gene expression experiments is still commonly used and widely available for retrospective analyses. However, due to to changes of the underlying technologies data sets from different technologies are often difficult to compare and thus a multitude of already available data becomes difficult to use. We present a web application that abstracts away mathematical and programmatical details in order to enable a convenient and customizable analysis of microarray data for large-scale reproducibility studies. In addition, the web application provides a feature that allows easy access to large microarray repositories. Results Our web application consists of three basic steps which are necessary for a differential gene expression analysis as well as Gene Ontology (GO) enrichment analysis and the comparison of multiple analysis results. Genealyzer can handle Affymetrix data as well as one-channel and two-channel Agilent data. All steps are visualized with meaningful plots. The application offers flexible analysis while being intuitively operable. Conclusions Our web application provides a unified platform for analysing microarray data, while allowing users to compare the results of different technologies and organisms. Beyond reproducibility, this also offers many possibilities for gaining further insights from existing study data, especially since data from different technologies or organisms can also be compared. The web application can be accessed via this URL: https://genealyzer.item.fraunhofer.de/. Login credentials can be found at the end.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.