2020
DOI: 10.1038/s41598-020-64053-w
|View full text |Cite
|
Sign up to set email alerts
|

Photosynthetic protein classification using genome neighborhood-based machine learning feature

Abstract: Identification of novel photosynthetic proteins is important for understanding and improving photosynthetic efficiency. Synergistically, genome neighborhood can provide additional useful information to identify photosynthetic proteins. We, therefore, expected that applying a computational approach, particularly machine learning (ML) with the genome neighborhood-based feature should facilitate the photosynthetic function assignment. Our results revealed a functional relationship between photosynthetic genes and… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1
1
1

Relationship

2
5

Authors

Journals

citations
Cited by 9 publications
(11 citation statements)
references
References 54 publications
0
11
0
Order By: Relevance
“…The random forest analysis yielded a machine learning classifier that labeled genomes as CBB-positive or CBB-negative with 72.9% accuracy using ECs and 76.5% accuracy using Pfams, compared to the expected accuracy of 50% if picking labels at random. For comparison, it has been shown that a random forest can achieve 88% accuracy in predicting photosynthetic proteins based on gene neighborhood [ 49 ], a tree-based classifier can achieve 86% accuracy (94% with a k-nearest neighbor model) in predicting the recombination status of HIV genomes [ 50 ], and a support vector machine can achieve 87% accuracy in classifying bacteria as pathogenic or not based on their proteomes [ 51 ]. The accuracy achieved in the present study was likely limited by false negative genomes, genomes in the process of adapting to recent loss or gain of the Calvin cycle, a low amount of training data, and limited ability of the random forest to focus on relevant aspects of the data.…”
Section: Resultsmentioning
confidence: 99%
“…The random forest analysis yielded a machine learning classifier that labeled genomes as CBB-positive or CBB-negative with 72.9% accuracy using ECs and 76.5% accuracy using Pfams, compared to the expected accuracy of 50% if picking labels at random. For comparison, it has been shown that a random forest can achieve 88% accuracy in predicting photosynthetic proteins based on gene neighborhood [ 49 ], a tree-based classifier can achieve 86% accuracy (94% with a k-nearest neighbor model) in predicting the recombination status of HIV genomes [ 50 ], and a support vector machine can achieve 87% accuracy in classifying bacteria as pathogenic or not based on their proteomes [ 51 ]. The accuracy achieved in the present study was likely limited by false negative genomes, genomes in the process of adapting to recent loss or gain of the Calvin cycle, a low amount of training data, and limited ability of the random forest to focus on relevant aspects of the data.…”
Section: Resultsmentioning
confidence: 99%
“…15,191 protein sequences consisting of at least one of 61 photosynthesis-specific GO terms identified by Ashkenazi et al [ 12 ] were included in the dataset. To avoid incomplete gene neighborhood identification, only photosynthetic proteins from 154 photosynthetic prokaryotes with complete genomes were included in the dataset, as previously reported [ 15 ]. To reduce sequence redundancy, we used the USEARCH analytical tool [ 22 ] to cluster similar sequences (sequence identity ≤ 50% as a diverse dataset and ≤ 70% as an easy dataset), using the command: where the input file is in FASTA format, identity is the percent sequence identity cutoff for the cluster, and output is the selected representative sequence in FASTA format.…”
Section: Methodsmentioning
confidence: 99%
“…Two gene clusters were merged into the same neighborhood gene cluster if they were in a range of 200-1000 bp in the divergent direction, following the operon interaction concept as demonstrated in S1 Fig. The homologous relationship between protein sequences was determined by the protein clustering method with three stringent criteria: 1E-10, 1E-50, and 1E-100, according to a previous study [15]. The genome neighborhood conservation scores (Phylo scores) were calculated based on the phylogenetic tree of organisms that conserve those gene neighborhoods, as described previously [15]. The quantile cutoff points were determined and used for converting the raw Phylo scores to simple numeric forms (i.e., 0, 1, 2, and 3).…”
Section: Photomodgo Developmentmentioning
confidence: 99%
See 2 more Smart Citations