BackgroundHigh-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing.ResultsWe present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner.ConclusionsPLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/.Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2105-15-311) contains supplementary material, which is available to authorized users.
Thousands of long intergenic noncoding RNAs (lincRNAs) have been identified in the human and mouse genomes, some of which play important roles in fundamental biological processes. The pig is an important domesticated animal, however, pig lincRNAs remain poorly characterized and it is unknown if they were involved in the domestication of the pig. Here, we used available RNA-seq resources derived from 93 samples and expressed sequence tag data sets, and identified 6,621 lincRNA transcripts from 4,515 gene loci. Among the identified lincRNAs, some lincRNA genes exhibit synteny and sequence conservation, including linc-sscg2561, whose gene neighbor Dnmt3a is associated with emotional behaviors. Both linc-sscg2561 and Dnmt3a show differential expression in the frontal cortex between domesticated pigs and wild boars, suggesting a possible role in pig domestication. This study provides the first comprehensive genome-wide analysis of pig lincRNAs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.