High-throughput genotyping and sequencing technologies can generate dense sets of genetic markers for large numbers of individuals. For most species, these data will contain many markers in linkage disequilibrium (LD). To utilize such data for population structure inference, we investigate the use of haplotypes constructed by combining the alleles at single-nucleotide polymorphisms (SNPs). We introduce a statistic derived from information theory, the gain of informativeness for assignment (GIA), which quantifies the additional information for assigning individuals to populations using haplotype data compared to using individual loci separately. Using a two-loci-two-allele model, we demonstrate that combining markers in linkage equilibrium into haplotypes always leads to nonpositive GIA, suggesting that combining the two markers is not advantageous for ancestry inference. However, for loci in LD, GIA is often positive, suggesting that assignment can be improved by combining markers into haplotypes. Using GIA as a criterion for combining markers into haplotypes, we demonstrate for simulated data a significant improvement of assigning individuals to candidate populations. For the many cases that we investigate, incorrect assignment was reduced between 26% and 97% using haplotype data. For empirical data from French and German individuals, the incorrectly assigned individuals can, for example, be decreased by 73% using haplotypes. Our results can be useful for challenging population structure and assignment problems, in particular for studies where large-scale population-genomic data are available.
STRUCTURE of populations and assigning individuals to populations have attracted considerable attention in population genetics, conservation biology, and ecology (Pritchard et al. 2000;Beaumont 2004;Manel et al. 2005;Platt et al. 2010). Since the introduction of Wright's F ST (Wright 1921(Wright , 1943, numerous studies of population structure have been conducted for a multitude of species, using a variety of genetic or phenotypic markers. The recent development of high-throughput genotyping and sequencing technologies has resulted in a substantial increase in studies of population structure that are based on a large number of markers (e.g., Jakobsson et al. 2008;Platt et al. 2010;Vonholdt et al. 2010). At the same time, powerful clustering methods have been developed to infer population structure on the basis of multiloci genetic data (e.g., Pritchard et al. 2000;Dawson and Belkhir 2001;Corander et al. 2003;François et al. 2006;Huelsenbeck and Andolfatto 2007;Alexander et al. 2009).For most species, individuals rarely reproduce at random and this can create genetically differentiated subgroups within a population or species. Geographic barriers such as mountains, rivers, and oceans can furthermore hinder random mating, thereby causing populations to be structured (Hale et al. 2001;Rosenberg et al. 2005). In humans, cultural differences, such as language or religious beliefs, may play an additional role in shapin...