Factors influencing gene family size variation among related species in a plant family

Wang, Peipei; Moore, Bethany M.; Panchy, Nicholas; Meng, Fanrui; Lehti-Shiu, Melissa D.; Shiu, Shin-Han

doi:10.1101/270009

Cited by 12 publications

(17 citation statements)

References 91 publications

(97 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Custom HMM models for major NLR exons (FISNA-NACHT, and PRY-SPRY/ B30.2) were generated and utilized during this process (see Additional file 29 for hmm models). The majority of identified FISNA-NACHT exons contained frameshifts or a large insertion, indicating either pseudogenization, acquisition of new introns, problems with the assembly, or a combination of the three [228]. For subsequent phylogenetic analyses, only the 61 clearly intact NLRs were used.…”

Section: Immune Systemmentioning

confidence: 99%

The round goby genome provides insights into mechanisms that may facilitate biological invasions

et al. 2020

View full text Add to dashboard Cite

Background: The invasive benthic round goby (Neogobius melanostomus) is the most successful temperate invasive fish and has spread in aquatic ecosystems on both sides of the Atlantic. Invasive species constitute powerful in situ experimental systems to study fast adaptation and directional selection on short ecological timescales and present promising case studies to understand factors involved the impressive ability of some species to colonize novel environments. We seize the unique opportunity presented by the round goby invasion to study genomic substrates potentially involved in colonization success. Results: We report a highly contiguous long-read-based genome and analyze gene families that we hypothesize to relate to the ability of these fish to deal with novel environments. The analyses provide novel insights from the large evolutionary scale to the small species-specific scale. We describe expansions in specific cytochrome P450 enzymes, a remarkably diverse innate immune system, an ancient duplication in red light vision accompanied by red skin fluorescence, evolutionary patterns of epigenetic regulators, and the presence of osmoregulatory genes that may have contributed to the round goby's capacity to invade cold and salty waters. A recurring theme across all analyzed gene families is gene expansions. Conclusions: The expanded innate immune system of round goby may potentially contribute to its ability to colonize novel areas. Since other gene families also feature copy number expansions in the round goby, and since other Gobiidae also feature fascinating environmental adaptations and are excellent colonizers, further long-read genome approaches across the goby family may reveal whether gene copy number expansions are more generally related to the ability to conquer new habitats in Gobiidae or in fish.

show abstract

Section: Immune Systemmentioning

confidence: 99%

The round goby genome provides insights into mechanisms that may facilitate biological invasions

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The most recent duplication points for genes appearing to originate from multiple duplication nodes were defined by the highest-numbered node they belonged to (Figure S7). Pseudogenes in tomato were determined as in Wang et al (2018) where genomic regions with significant similarity to proteincoding genes but with premature stops/frameshifts and/or were truncated were treated as pseudogenes (64). Detailed methods and parsing scripts for different features can be found in: https://github.com/ShiuLab/SM-gene_prediction_Slycopersicum.…”

Section: Evolutionary Featuresmentioning

confidence: 99%

Within and cross species predictions of plant specialized metabolism genes using transfer learning

Moore

Wang

Fan

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Plant specialized metabolites mediate interactions between plants and the environment and have significant agronomical/pharmaceutical value. Most genes involved in specialized metabolism (SM) are unknown because of the large number of specialized metabolites and the challenge in differentiating SM genes from general metabolism (GM) genes. We employed transfer learning, a machine learning strategy in which information from one species with substantially more experimentally derived function data (Arabidopsis thaliana) is used to build a model to predict gene functions in another species (Solanum lycopersicum). Using machine learning to integrate heterogenous gene features, we built models distinguishing tomato SM and GM genes. Although SM/GM genes can be predicted based on tomato data alone (F-measure=0.74, compared with 0.5 for random and 1.0 for perfect predictions), using information from Arabidopsis to filter likely misannotated genes significantly improves prediction (F-measure= 0.92). This study demonstrates that SM/GM genes can be better predicted by leveraging cross-species information. SignificanceWith the increase of sequenced non-model species, a major challenge in plant biology is to ascertain gene function. Model species such as Arabidopsis thaliana have large amounts of experimentally-backed annotations that non-model species lack. We show how to use a model species to better annotate the function of genes in a non-model species using a technique called transfer learning. In particular, we focus on genes involved in specialized metabolism (SM), or metabolism specific to a certain plant lineage, which are not well known because of the sizeable diversity of specialized metabolites (SMs) among plant species. We use Arabidopsis to predict SM genes in tomato, a species with many SMs of interest but with a poorer annotation than Arabidopsis.

show abstract

“…Aside from gene annotations, we defined or obtained additional genome features including pseudogenes, transposable elements (TE), simple sequence repeat (SSR; stretch of DNA, 2 ~ 64bp, repeated >1 time and the repetitions are immediately adjacent to each other), and tandemly duplicated genes. Pseudogenes were defined as genomic regions with significant similarity to protein-coding genes had premature stops/frameshifts and/or were truncated as described in (27). Transposable element (TE) annotation was based on SGN ITAG2.4 release.…”

Section: Genome and Functional Annotations Definitions Of Genome Feamentioning

confidence: 99%

“…Minscore = 50, Maxperiod = 500 (28). Tandemly duplicated genes were identified using MCScanX-transposed (29), as described previously (27), where paralogs are directly adjacent to each other, or separated by…”

Section: Genome and Functional Annotations Definitions Of Genome Feamentioning

confidence: 99%

Read coverage as an indicator of misassembly in a short-read based genome assembly

Wang

Meng

Moore

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

the original research plans; F.M. and P.W. conducted the bulk of the computational analysis; B.M. assisted with the machine learning models; P.W. and S.-H.S. wrote the manuscript with contributions of all the authors; S.-H.S. agrees to serve as the author responsible for contact and ensures communication. AbstractAvailability of genome sequences has led to significant progress in biological sciences and beyond. With few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies. While extremely useful, the short-read coverages across the assemblies are highly uneven, indicative of sequencing and assembly issues. To assess the underlying causes of such uneven read coverage, we used the tomato genome as an example and integrated multiple sequence features to establish machine learning models capable of predicting whether a genomic region has significantly high or low read coverage. Importantly, 0.6% (5.1Mb) and 9.7% (79.6Mb) of tomato genome assembly had significantly higher and lower coverage compared to background, respectively. By evaluating features important for the prediction, we found that GC content and high density of transposon elements are the major contributors to break points in an assembly, leading to gaps filled with Ns and the resulting low read coverages. In contrast, simple sequence repeats and tandemly duplicated genes, especially specialized metabolism genes, tend to be mis-assembled, resulting in high read coverages. We also present evidence of a misassembled regions containing tandemly duplicated specialized metabolism genes. The presence of variable coverage regions is expected to significantly impact genome-wide studies, highlighting the need to detect them in short-read based assemblies.

show abstract

Factors influencing gene family size variation among related species in a plant family

Cited by 12 publications

References 91 publications

The round goby genome provides insights into mechanisms that may facilitate biological invasions

The round goby genome provides insights into mechanisms that may facilitate biological invasions

Within and cross species predictions of plant specialized metabolism genes using transfer learning

Read coverage as an indicator of misassembly in a short-read based genome assembly

Contact Info

Product

Resources

About