Reads Binning Improves Alignment-Free Metagenome Comparison

Song, Kai; Ren, Jie; Sun, Fengzhu

doi:10.3389/fgene.2019.01156

Cited by 16 publications

(13 citation statements)

References 61 publications

(88 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first group refers to GC content features, these features are classic ones in gene prediction. The second group refers to k-mer features, these features are widely used in other branches of Bioinformatics such as assembly [ 25 ] and binning [ 30 ], but still little explored in gene prediction problems.

Fig.…”

Section: Methodsmentioning

confidence: 99%

geneRFinder: gene finding in distinct metagenomic data complexities

et al. 2021

View full text Add to dashboard Cite

Background Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also creates a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available that can aid the gene annotation process though they lack handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. Results We introduce geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar’s test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. Conclusions We provide geneRFinder, an approach for gene prediction in distinct metagenomic complexities, available at gitlab.com/r.lorenna/generfinder and https://osf.io/w2yd6/, and also we provide a novel, comprehensive benchmark data for gene prediction—which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions—available at https://sourceforge.net/p/generfinder-benchmark.

show abstract

Fig.…”

Section: Methodsmentioning

confidence: 99%

geneRFinder: gene finding in distinct metagenomic data complexities

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Short k -mer ( k < 15) based measures, such as

and CVtree , calculate dissimilarity between sequences or high-throughput sequencing samples ( Jiang et al, 2012 ; Liao et al, 2016 ; Song et al, 2019 ) using the global statistical models. Based on long k -mers ( k > 21), Mash ( Ondov et al, 2016 ), Skmer ( Sarmashghi et al, 2019 ), and Kmer-db ( Deorowicz et al, 2018 ) use MinHash to approximate Jaccard distance between pairwise sequences based on randomly sampled small set of k -mers.…”

Section: Methodsmentioning

confidence: 99%

KmerGO: A Tool to Identify Group-Specific Sequences With k-mers

et al. 2020

Self Cite

View full text Add to dashboard Cite

Capturing group-specific sequences between two groups of genomic/metagenomic sequences is critical for the follow-up identifications of singular nucleotide variants (SNVs), gene families, microbial species or other elements associated with each group. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered a “group-specific” sequence in our study. We developed a user-friendly tool, KmerGO, to identify group-specific sequences between two groups of genomic/metagenomic long sequences or high-throughput sequencing datasets. Compared with other tools, KmerGO captures group-specific k -mers ( k up to 40 bps) with much lower requirements for computing resources in much shorter running time. For a 1.05 TB dataset (.fasta), it takes KmerGO about 21.5 h (including k -mer counting) to return assembled group-specific sequences on a regular stand-alone workstation with no more than 1 GB memory. Furthermore, KmerGO can also be applied to capture trait-associated sequences for continuous-trait. Through multi-process parallel computing, KmerGO is implemented with both graphic user interface and command line on Linux and Windows free from any pre-installed supporting environments, packages, and complex configurations. The output group-specific k -mers or sequences from KmerGO could be the inputs of other tools for the downstream discovery of biomarkers, such as genetic variants, species, or genes. KmerGO is available at https://github.com/ChnMasterOG/KmerGO .

show abstract

“…The first group refers to GC content features, these features are classic ones in gene prediction, being used in tools such as FragGeneScan, Prodigal and Orphelia. The second group refers to k-mer features, these features are widely used in other branches of Bioinformatics such as assembly [30] and binning [31], but still little explored in gene prediction problems. The feature importance index was calculated according to importance method of the Caret package [32] and, as Figure 3 presents the sequence length as the most important one, followed by k-mer features, having more than 80% importance index.…”

Section: Feature Engineeringmentioning

confidence: 99%

geneRFinder: gene finding in distinct metagenomic data complexities

Silva

Padovani

Góes

et al. 2020

Preprint

View full text Add to dashboard Cite

MotivationMicrobes perform a fundamental economic, social and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also create a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available which can aid gene annotation process though they lack of handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates.ResultsWe introduce geneRFinder, a ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar’s test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval.ConclusionsWe provide geneRFinder, a approach for gene prediction in distinct metagenomic complexities, available at github.com/railorena/geneRFinder, and also we provide a novel, comprehensive benchmark data for gene prediction — which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions – avaliable at sourceforge.net/p/generfinder-benchmark.

show abstract

Reads Binning Improves Alignment-Free Metagenome Comparison

Cited by 16 publications

References 61 publications

geneRFinder: gene finding in distinct metagenomic data complexities

geneRFinder: gene finding in distinct metagenomic data complexities

KmerGO: A Tool to Identify Group-Specific Sequences With k-mers

geneRFinder: gene finding in distinct metagenomic data complexities

Contact Info

Product

Resources

About