In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of the human genome, which revolutionized the field of genomics. While these drafts and the updates that followed effectively covered the euchromatic fraction of the genome, the heterochromatin and many other complex regions were left unfinished or erroneous. Addressing this remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium has finished the first truly complete 3.055 billion base pair (bp) sequence of a human genome, representing the largest improvement to the human reference genome since its initial release. The new T2T-CHM13 reference includes gapless assemblies for all 22 autosomes plus chromosome X, corrects numerous errors, and introduces nearly 200 million bp of novel sequence containing 2,226 paralogous gene copies, 115 of which are predicted to be protein coding. The newly completed regions include all centromeric satellite arrays and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies for the first time.
International initiatives aimed at generating genomic resources, and particularly reference genomes, have flourished in recent years. Some focus on specific taxa, such as the Vertebrate Genomes Project, Bird Genome 10K Project, Bat1K Project, Global Invertebrate Genomics Alliance, 10 000 Plant Genomes Project, and 1000 Fungal Genomes project. Others focus on geographic regions, such as the California Conservation Genomics Project, Darwin Tree of Life for Britain and Ireland, Catalan Initiative for the Earth BioGenome Project in the Catalan territories, Endemixit in Italy, Norwegian Earth Biogenome Project, and SciLifeLab in Sweden, on applications such as the LOEWE Translational Biodiversity Genomics in Germany, or on ecological systems such as the Aquatic Symbiosis Genomics project. Collectively part of the Earth BioGenome Project (EBP), in Europe these initiatives are organized under the umbrella of the European Reference Genome Atlas (ERGA). A genome atlas of European biodiversityERGA is a pan-European scientific response to the current threats to biodiversity. Approximately one fifth of the ~200 000 eukaryotic species present in Europe can be inferred to be at risk of extinction according to the International Union for Conservation of Nature (IUCN) Red List classification (this estimate only considers the assessed species; https://www.iucn.org/regions/europe/our-work/biodiversity-conservation/european-red-list-threatened-species).ERGA aims to generate reference genomes of European eukaryotic species across the tree of life, including threatened, endemic, and keystone species, as well as pests and species important to agriculture, fisheries, and ecosystem function and stability. ERGA builds upon current genomic consortia in EU member states, EU Associated Countries, representatives of other countries within the European bioregion, and international collaborators. These reference genomes will address fundamental and applied questions in conservation, biology, and health. ERGA seeks to alert the EU about the potential of conservation genomics, and particularly the role of reference genomes, in biodiversity assessment, conservation strategies, and restoration efforts.
enome assembly is a foundational practice of quantitative biological research with increasing utility. By representing the genomic sequence of a sample of interest, genome assemblies enable researchers to annotate important features, quantify functional data and discover/genotype genetic variants in a population [1][2][3][4][5][6] . Modern draft eukaryotic genome assembly graphs are typically built from a subset of four whole-genome shotgun (WGS) sequencing data types: Illumina short reads 7,8 , Oxford Nanopore Technologies (ONT) long reads 9,10 , PacBio continuous long reads (CLRs) and PacBio high-fidelity (HiFi) long reads 9,11 , all of which have been extensively described [7][8][9]11 . However, we note that even the high-accuracy technologies produce sequencing data with some noise caused by platform-specific technical biases that require careful validation and polishing 1,[11][12][13][14] .Current genome assembly software attempts to reconstruct an individual or mosaic haplotype sequence from a subset of the above WGS data types. Some assemblers do not attempt to correct sequencing errors 15 , while others attempt to remove errors at various stages of the assembly process [16][17][18][19][20] . Regardless, technology-specific sequencing errors usually lead to distinct assembly errors 14,21 . Additionally, suboptimal assembly of specific loci often causes small and large errors in draft assemblies 22,23 . Here, we define 'polishing' as the process of removing these errors from draft genome assemblies. Most polishing tools use an approach that is similar to sequence-based genetic variant discovery. Specifically, reads from the same individual are aligned to a draft assembly, and putative 'variant'-like sequence edits are identified 23,24 . For diploid genomes, heterozygous 'alternate' alleles are interpreted as genuine heterozygous variants, while homozygous alternate alleles are interpreted as assembly errors to be corrected. Some polishing tools, such as Quiver/Arrow, Nanopolish, Medaka, DeepVariant and PEPPER leverage specialized models and previous knowledge to correct errors caused by technology-specific bias [25][26][27][28][29] . Others, such as Racon 30 , use generic methods to correct assembly errors with a subset of sequencing technologies [30][31][32] . These generic tools can utilize multiple data types to synergistically overcome technology-specific assembly errors.The telomere-to-telomere (T2T) consortium recently convened an international workshop to assemble the first-ever complete sequence of a human genome. Because heterozygosity can complicate assembly algorithms, the consortium chose to assemble the highly homozygous genome of a complete hydatidiform mole
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure including long palindromes, tandem repeats, and segmental duplications. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029 base pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, revealing the complete ampliconic structures of TSPY, DAZ, and RBMY; 42 additional protein-coding genes, mostly from the TSPY gene family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a prior assembly of the CHM13 genome and mapped available population variation, clinical variants, and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.
BackgroundPlacental mammals display a huge range of life history traits, including size, longevity, metabolic rate and germ line generation time. Although a number of general trends have been proposed between these traits, there are exceptions that warrant further investigation. Species such as naked mole rat, human and certain bat species all exhibit extreme longevity with respect to body size. It has long been established that telomeres and telomere maintenance have a clear role in ageing but it has not yet been established whether there is evidence for adaptation in telomere maintenance proteins that could account for increased longevity in these species.ResultsHere we carry out a molecular investigation of selective pressure variation, specifically focusing on telomere associated genes across placental mammals. In general we observe a large number of instances of positive selection acting on telomere genes. Although these signatures of selection overall are not significantly correlated with either longevity or body size we do identify positive selection in the microbat species Myotis lucifugus in functionally important regions of the telomere maintenance genes DKC1 and TERT, and in naked mole rat in the DNA repair gene BRCA1.ConclusionThese results demonstrate the multifarious selective pressures acting across the mammal phylogeny driving lineage-specific adaptations of telomere associated genes. Our results show that regardless of the longevity of a species, these proteins have evolved under positive selection thereby removing increased longevity as the single selective force driving this rapid rate of evolution. However, evidence of molecular adaptations specific to naked mole rat and Myotis lucifugus highlight functionally significant regions in genes that may alter the way in which telomeres are regulated and maintained in these longer-lived species.
The field of genomics has benefited greatly from its “openness” approach to data sharing. However, with the increasing volume of sequence information being created and stored and the growing number of international genomics efforts, the equity of openness is under question. The United Nations Convention of Biodiversity aims to develop and adopt a standard policy on access and benefit-sharing for sequence information across signatory parties. This standardization will have profound implications on genomics research, requiring a new definition of open data sharing. The redefinition of openness is not unwarranted, as its limitations have unintentionally introduced barriers of engagement to some, including Indigenous Peoples. This commentary provides an insight into the key challenges of openness faced by the researchers who aspire to protect and conserve global biodiversity, including Indigenous flora and fauna, and presents immediate, practical solutions that, if implemented, will equip the genomics community with both the diversity and inclusivity required to respectfully protect global biodiversity.
Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.