PATtyFams: Protein Families for the Microbial Genomes in the PATRIC Database

Davis, James J.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Pusch, Gordon D.; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R.; Yoo, Hyunseung

doi:10.3389/fmicb.2016.00118

Cited by 172 publications

(136 citation statements)

References 47 publications

Supporting

Mentioning

136

Contrasting

Order By: Relevance

“…Last year, we developed a new algorithm for generating protein families that could be applied to the entire collection of PATRIC genomes (47). The method works by using the annotation process to guide family formation (4).…”

Section: What's New In Patric?mentioning

confidence: 99%

“…Overall, the PATtyFam algorithm is rapid and generates protein families resembling those created by alignment-based algorithms (47). When a genome is submitted to the PATRIC annotation service, protein families are automatically assigned to each protein by projection, thus enabling a user to compare their private genome with the PATRIC collection.…”

Section: What's New In Patric?mentioning

confidence: 99%

See 1 more Smart Citation

Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center

et al. 2016

Self Cite

View full text Add to dashboard Cite

The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by ‘virtual integration’ to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics.

show abstract

Section: What's New In Patric?mentioning

confidence: 99%

Section: What's New In Patric?mentioning

confidence: 99%

Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center

et al. 2016

Self Cite

View full text Add to dashboard Cite

show abstract

“…In order to work with clean subsets of genomes, we chose to base analyses on the proteinencoding genes that are shared among members of the same species. We used the "PATtyFam" collection, which is a set of protein families that cover the ~230,000 publicly available genomes in the PATRIC database 40 . Protein similarity for building these families is based on the RAST signature k-mer collection 37 , and all proteins must have the same annotation in order to be members of the same family.…”

Section: Core Conserved Gene Setsmentioning

confidence: 99%

“…In previous work, we observed that is possible to build accurate AMR phenotype prediction models from whole genomes without using the AMR genes 21 . In this study, in order to explore the possibility of building models from limited genome sequence data, we chose to build models from core genes that are held in common among the members of a species, and which are not annotated as having a direct role in AMR 37,40 . By being nearly universally conserved, core genes are less likely to be horizontally transferred, and are also useful for assessing genome completeness and phylogeny.…”

Section: Amr Models Based On Core Genes Have Predictive Powermentioning

confidence: 99%

Predicting Antimicrobial Resistance Using Conserved Genes

Nguyen

Olson

Shukla

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Abbreviations AIartificial intelligence CI confidence interval ME major error, susceptible genomes predicted to be resistant MIC minimum inhibitory concentration PATRIC Pathosystems resource integration center PLF PATRIC local protein family RAST Rapid annotation using subsystem technology RF Random Forest SR susceptible and resistant VME very major error, resistant genomes predicted to be susceptible XGB XGBoost Abstract A growing number of studies have shown that machine learning algorithms can be used to accurately predict antimicrobial resistance (AMR) phenotypes from bacterial sequence data. In these studies, models are typically trained using input features derived from comprehensive sets of known AMR genes or whole genome sequences. However, it can be difficult to determine whether genomes and their corresponding sets of AMR genes are complete when sequencing contaminated or metagenomic samples. In this study, we explore the possibility of using incomplete genome sequence data to predict AMR phenotypes. Machine learning models were built from randomly-selected sets of core genes that are held in common among the members of a species, and the AMR-conferring genes were removed based on their protein annotations. For Klebsiella pneumoniae, Mycobacterium tuberculosis, Salmonella enterica, and Staphylococcus aureus, we report that it is possible to classify susceptible and resistant phenotypes with average F1 scores ranging from 0.80-0.89 with as few as 100 conserved non-AMR genes, with very major error rates ranging from 0.11-0.23 and major error rates ranging from 0.10-0.20. Models built from core genes have predictive power in the cases where the primary AMR mechanism results from SNPs or horizontal gene transfer. By randomly sampling non-overlapping sets of core genes for use in these models, we show that F1 scores and error rates are stable and have little variance between replicates. Potential biases from strainspecific SNPs, phylogenetic sampling, and imbalances in the phylogenetic distribution of susceptible and resistant strains do not appear to have an impact on this result. Although these small core gene models have lower accuracies and higher error rates than models built from the corresponding assembled genomes, the results suggest that sufficient variation exists in the core non-AMR genes of a species for predicting AMR phenotypes. Overall this study suggests that building models from conserved genes may be a potentially useful strategy for predicting AMR phenotypes when genomes are incomplete.

show abstract

“…Specifically, Inparanoid distinguishes between orthologs and in- 15 paralogs, which were duplicated following a given speciation event [4][5][6]. It is then former identify orthologs and in-paralogs using proxy methods rather than directly 41 inferring homology type from gene and species evolutionary history. In practice, 42 graph-based methods have a similar accuracy as tree-based methods [9,10,19].…”

mentioning

confidence: 99%

SwiftOrtho: a Fast, Memory-Efficient, Multiple Genome Orthology Classifier

Friedberg

2019

Preprint

View full text Add to dashboard Cite

Introduction: Gene homology type classification is a requisite for many types of genome analyses, including comparative genomics, phylogenetics, and protein function annotation. A large variety of tools have been developed to perform homology classification across genomes of different species. However, when applied to large genomic datasets, these tools require high memory and CPU usage, typically available only in costly computational clusters. To address this problem, we developed a new graph-based orthology analysis tool, SwiftOrtho, which is optimized for speed and memory usage when applied to large-scale data. Results: In our tests, SwiftOrtho is the only tool that completed orthology analysis of 1,760 bacterial genomes on a computer with only 4GB RAM. Using various standard orthology datasets, we also show that SwiftOrtho has a high accuracy. SwiftOrtho enables the accurate comparative genomic analyses of thousands of genomes using low memory computers. Availability: https://github.com/Rinoahu/SwiftOrtho Background 1 Gene homology type classification consists of identifying paralogs and orthologs 2 across species. Orthologs are genes that evolved from a common ancestral gene fol-3 lowing speciation, while paralogs are genes that are homologous due to duplication. 4Computationally detecting orthologs and paralogs across species is an important 5 problem, as the evolutionary history of genes has implications for our understand-6 ing of gene function and evolution.7 While the proper inference of homology type involves tracing gene history using 8 phylogenetic trees [1], several proxy methods have been developed over the years.9The most common method to infer orthologs by proxy is Reciprocal Best Hit or 10 RBH [2, 3]. Briefly, RBH states the following: when two proteins that are encoded they are considered to be orthologs [2,3]. 13Inparanoid extends the RBH orthology relationship to include both orthologs and 14 in-paralogs [4][5][6]. Specifically, Inparanoid distinguishes between orthologs and in-15 paralogs, which were duplicated following a given speciation event [4][5][6]. It is then 16 a matter of course to extend orthologous pairs between two species to an ortholog 17 group, where an ortholog group is defined as a set of genes that are hypothesized to 18 have descended from a common ancestor [6]. Several methods have been developed 19 to identify ortholog groups across multiple species. These methods can be classi-20 fied into two types: tree-based and graph-based. Tree-based methods construct a 21 gene tree from an alignment of homologous sequences in different species and infer 22 orthology relationships by reconciling the gene tree with its corresponding species 23 tree [1,7,8]. Tree-based methods can infer a correct orthology relationship if the 24 correct gene tree and species tree are given [9]. The main limitation of tree-based 25 methods is the accuracy of the given gene tree and species tree. Erroneous trees 26 lead to incorrect ortholog and in-paralog assignments [8][9][10]. Tree-based methods 2...

show abstract

PATtyFams: Protein Families for the Microbial Genomes in the PATRIC Database

Cited by 172 publications

References 47 publications

Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center

Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center

Predicting Antimicrobial Resistance Using Conserved Genes

SwiftOrtho: a Fast, Memory-Efficient, Multiple Genome Orthology Classifier

Contact Info

Product

Resources

About