Recent years have seen a surge in plant genome sequencing projects and the comparison of multiple related individuals. The high degree of genomic variation observed led to the realisation that single reference genomes do not represent the diversity within a species, and led to the expansion of the pan-genome concept. Pan-genomes represent the genomic diversity of a species and includes core genes, found in all individuals, as well as variable genes which are absent in some individuals. Variable gene annotations often show similarities across plant species, with genes for biotic and abiotic stress commonly enriched within variable gene groups. Here we review the growth of pan-genomics in plants, explore the origins of gene presence/absence variation and show how pan-genomes can support plant breeding and evolution studies.
Pan-genomes in plants: beginnings and current statusThe concept of pan-genomes was first developed in bacteria in 2005 1 , where the sequencing of several isolates of Streptococcus agalactiae revealed a core genome represented by 80% of S. agalactiae genes, with the other 20% being absent in at least one isolate 1 . However, it took almost 10 years for plant pan-genomes to be constructed after the initial bacterial pangenome work. This was partially due to the expense of data generation, but also the expectation that there would be very little gene presence/absence variation (PAV) in higher organisms which do not exchange genetic material as freely as bacteria 2 . The first publication to apply the term pan-genome in plants appeared in 2007, where it described short variable regions in the rice and maize genomes 3 . However, the extent of gene presence/absence was not understood at that time due to lack of accurate whole genome assemblies for multiple individuals of the same species. However, as DNA sequencing costs declined, it became feasible to undertake whole genome comparisons within species, and three general approaches for pan-genome assembly were developed 4,5 (Figure 1). The first method developed was the whole genome assembly and comparison, where the genomes of multiple individuals are assembled and then compared. This was later complemented by the iterative assembly and presence/absence variation calling approach, where genomic reads from multiple individuals are aligned to a reference, and non-aligning reads assembled and added to the growing pan-genome reference. Subsequent remapping of all reads to the pan-genome permits PAV calling across the population. More recently there have been rapid developments in graph based pan-genome assembly, where a graph representing genomic diversity and conservation is constructed 6 .