Microarrays can be a cost-effective alternative to high-throughput sequencing for discovering novel single-nucleotide polymorphisms (SNPs). Illumina’s iScan platform dominates the market, but their commercial microarray products are designed for model organisms. Further, the platform outputs data in a proprietary format. This cannot be easily converted to human-readable genotypes or be merged with pre-existing data. To address this, we present and validate a novel pipeline to facilitate data analysis from cross-species application of Illumina microarrays. This facilitates the generation of a compatible VCF from iScan data and the merging of this with a second VCF comprising genotypes derived from other samples and sources. Our pipeline includes a custom script, iScanVCFMerge (presented as a Python package), which we validate using iScan data from three great ape genera. We conclude that cross-species application of microarrays can be a rapid, cost-effective approach for SNP discovery in non-model organisms. Our pipeline surmounts the common challenges of integrating iScan genotypes with pre-existing data.
Background Orang-utans comprise three critically endangered species endemic to the islands of Borneo and Sumatra. Though whole-genome sequencing has recently accelerated our understanding of their evolutionary history, the costs of implementing routine genome screening and diagnostics remain prohibitive. Capitalizing on a tri-fold locus discovery approach, combining data from published whole-genome sequences, novel whole-exome sequencing, and microarray-derived genotype data, we aimed to develop a highly informative gene-focused panel of targets that can be used to address a broad range of research questions. Results We identified and present genomic co-ordinates for 175,186 SNPs and 2315 Y-chromosomal targets, plus 185 genes either known or presumed to be pathogenic in cardiovascular (N = 109) or respiratory (N = 43) diseases in humans – the primary and secondary causes of captive orang-utan mortality – or a majority of other human diseases (N = 33). As proof of concept, we designed and synthesized ‘SeqCap’ hybrid capture probes for these targets, demonstrating cost-effective target enrichment and reduced-representation sequencing. Conclusions Our targets are of broad utility in studies of orang-utan ancestry, admixture and disease susceptibility and aetiology, and thus are of value in addressing questions key to the survival of these species. To facilitate comparative analyses, these targets could now be standardized for future orang-utan population genomic studies. The targets are broadly compatible with commercial target enrichment platforms and can be utilized as published here to synthesize applicable probes.
The Sumatran orang-utan (Pongo abelii) reference genome was first published in 2011, in conjunction with ten re-sequenced genomes from unrelated wild-caught individuals. Together, these published data have been utilized in almost all great ape genomic studies, plus in much broader comparative genomic research. Here, we report that the original sequencing Consortium inadvertently switched nine of the ten samples and/or resulting re-sequenced genomes, erroneously attributing eight of these to the wrong source individuals. Among them is a genome from the recently identified Tapanuli (P. tapanuliensis) species: thus, this genome was sequenced and published a full six years prior to the species’ description. Sex was wrongly assigned to five known individuals; the numbers in one sample identifier were swapped; and the identifier for another sample most closely resembles that of a sample from another individual entirely. These errors have been reproduced in countless subsequent manuscripts, with noted implications for studies reliant on data from known individuals.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.