The contemporary capacity of genome sequence analysis significantly lags behind the rapidly evolving sequencing technologies. Retrieving biological meaningful information from an everincreasing amount of genome data would be significantly beneficial for functional genomic studies. For example, the duplication, organization, evolution, and function of superfamily genes are arguably important in many aspects of life. However, the incompleteness of annotations in many sequenced genomes often results in biased conclusions in comparative genomic studies of superfamilies. Here, we present a Perl software, called Closing Target Trimming, for automatically identifying most, if not all, members of a gene family in any sequenced genomes. Our test data on the F-box gene superfamily showed 78.2 and 79% gene finding accuracies in two well annotated plant genomes, Arabidopsis thaliana and rice, respectively. This annotation performance is clearly higher than the best ab initio methods that are currently available. To further demonstrate the effectiveness of this program, we ran it through 18 plant genomes and five non-plant genomes to compare the expansion of the F-box and the BTB superfamilies. The program discovered that on average 12.7 and 9.3% of the total F-box and BTB members, respectively, are new loci in plant genomes while it only found a small number of new members in vertebrate genomes. Therefore, different evolutionary and regulatory mechanisms of cullin-RING ubiquitin ligases may be present in the plant and the animal kingdoms. Further studies may shed light on new discoveries in the ubiquitin-26S proteasome system-mediated regulatory pathways in eukaryotic organisms. With a detailed compiling instruction and a simple running operation, we expect that this software will assist many biological scientists with little programming experience to smoothly obtain a comprehensive dataset of a gene superfamily in any sequenced eukaryotic genomes.finding_putative_new_loci.pm, Table 1). However, our previous study used BLAT search [29] to locate an annotated gene, which may result in inaccurate genomic coordinates [5].After CTT annotation, we discovered that, on average, 12.7 and 9.3% of the total members in the F-box and BTB families, respectively, are new loci (Tables 3 and 4, S5-S8 Files). Although a slightly higher proportion of F-box genes were discovered than that of BTB genes (p < 0.05, Student's t-test), the percentiles of new members in these two families are significantly correlated (ρ = 0.69, p-value = 0.001, Spearman's correlation test) ( Figure 2B). Therefore, there are various annotation qualities in different sequenced genomes, further highlighting the effectiveness of this package in helping identify most, if not all, superfamily members in a genome for comparative genomic studies.
CTT Annotation of the F-box and the BTB Superfamilies in Non-plant GenomesSince the CTT algorithm was first developed to study the F-box gene superfamily in plants [5], we questioned whether this newly designed CTT package was also...