Introduction: Gene homology type classification is a requisite for many types of genome analyses, including comparative genomics, phylogenetics, and protein function annotation. A large variety of tools have been developed to perform homology classification across genomes of different species. However, when applied to large genomic datasets, these tools require high memory and CPU usage, typically available only in costly computational clusters. To address this problem, we developed a new graph-based orthology analysis tool, SwiftOrtho, which is optimized for speed and memory usage when applied to large-scale data. Results: In our tests, SwiftOrtho is the only tool that completed orthology analysis of 1,760 bacterial genomes on a computer with only 4GB RAM. Using various standard orthology datasets, we also show that SwiftOrtho has a high accuracy. SwiftOrtho enables the accurate comparative genomic analyses of thousands of genomes using low memory computers. Availability: https://github.com/Rinoahu/SwiftOrtho Background 1 Gene homology type classification consists of identifying paralogs and orthologs 2 across species. Orthologs are genes that evolved from a common ancestral gene fol-3 lowing speciation, while paralogs are genes that are homologous due to duplication.
4Computationally detecting orthologs and paralogs across species is an important 5 problem, as the evolutionary history of genes has implications for our understand-6 ing of gene function and evolution.7 While the proper inference of homology type involves tracing gene history using 8 phylogenetic trees [1], several proxy methods have been developed over the years.9The most common method to infer orthologs by proxy is Reciprocal Best Hit or 10 RBH [2, 3]. Briefly, RBH states the following: when two proteins that are encoded they are considered to be orthologs [2,3].
13Inparanoid extends the RBH orthology relationship to include both orthologs and 14 in-paralogs [4][5][6]. Specifically, Inparanoid distinguishes between orthologs and in-15 paralogs, which were duplicated following a given speciation event [4][5][6]. It is then 16 a matter of course to extend orthologous pairs between two species to an ortholog 17 group, where an ortholog group is defined as a set of genes that are hypothesized to 18 have descended from a common ancestor [6]. Several methods have been developed 19 to identify ortholog groups across multiple species. These methods can be classi-20 fied into two types: tree-based and graph-based. Tree-based methods construct a 21 gene tree from an alignment of homologous sequences in different species and infer 22 orthology relationships by reconciling the gene tree with its corresponding species 23 tree [1,7,8]. Tree-based methods can infer a correct orthology relationship if the 24 correct gene tree and species tree are given [9]. The main limitation of tree-based 25 methods is the accuracy of the given gene tree and species tree. Erroneous trees 26 lead to incorrect ortholog and in-paralog assignments [8][9][10]. Tree-based methods 2...