Background
High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis.
Results
We report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes.
Conclusions
rPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis.
Cephalochordate amphioxus is a promising model animal for studying the evolutionary and developmental mechanisms of vertebrates because its unique phylogenetic position, simple body plan and sequenced genome. However, one major drawback for using amphioxus as a model organism is the restricted supply of living embryos since they are available only during spawning season that varies from a couple of days to several months according to species. Therefore we are aiming to develop methods for obtaining viable amphioxus embryos in non-spawning season. In the current study, we found that Branchiostoma belcheri could develop their gonads and spawn consecutively in the laboratory when cultured in a low density at a high temperature (25–28°C) supplied with sufficient food and proper cleanness. Among the approximate 150 observed animals, which spawned spontaneously between November and December 2011, 10% have spawned twice, 10% three times, and 80% four times, through April 2012. The quality and quantity of the gametes reproduced in the consecutive spawning have no obvious difference with those spawned once naturally. Spawning intervals varied dramatically both among different animals (from 1 to 5 months) and between intervals of a single individual (from 27 to 74 days for one animal). In summary, we developed a method with which, for the first time, consecutive spawnings of amphioxus in captivity can be achieved. This has practical implications for the cultivation of other amphioxus species, and eventually will greatly promote the utilization of amphioxus as a model system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.