Minimizing the Average Distance to a Closest Leaf in a Phylogenetic Tree

Matsen, F. A.; Gallagher, Aaron; McCoy, Connor O.

doi:10.1093/sysbio/syt044

Cited by 12 publications

(19 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An amplicon-specific profile HMM was created from an alignment of representative sequences from multiple subtypes. For each subject and amplicon, 20 reference sequences were selected by placing 454 reads on a tree of candidate reference sequences [34] and minimizing the average distance to the closest leaf [35]. These reference sequences, representatives from subtypes common to the region, and 454 reads were aligned to the HMM using hmmalign [36] and non-consensus columns removed.…”

Section: Methodsmentioning

confidence: 99%

HIV-1 Superinfection Occurs Less Frequently Than Initial Infection in a Cohort of High-Risk Kenyan Women

et al. 2013

Self Cite

View full text Add to dashboard Cite

HIV superinfection (reinfection) has been reported in several settings, but no study has been designed and powered to rigorously compare its incidence to that of initial infection. Determining whether HIV infection reduces the risk of superinfection is critical to understanding whether an immune response to natural HIV infection is protective. This study compares the incidence of initial infection and superinfection in a prospective seroincident cohort of high-risk women in Mombasa, Kenya. A next-generation sequencing-based pipeline was developed to screen 129 women for superinfection. Longitudinal plasma samples at <6 months, >2 years and one intervening time after initial HIV infection were analyzed. Amplicons in three genome regions were sequenced and a median of 901 sequences obtained per gene per timepoint. Phylogenetic evidence of polyphyly, confirmed by pairwise distance analysis, defined superinfection. Superinfection timing was determined by sequencing virus from intervening timepoints. These data were combined with published data from 17 additional women in the same cohort, totaling 146 women screened. Twenty-one cases of superinfection were identified for an estimated incidence rate of 2.61 per 100 person-years (pys). The incidence rate of initial infection among 1910 women in the same cohort was 5.75 per 100pys. Andersen-Gill proportional hazards models were used to compare incidences, adjusting for covariates known to influence HIV susceptibility in this cohort. Superinfection incidence was significantly lower than initial infection incidence, with a hazard ratio of 0.47 (CI 0.29–0.75, p = 0.0019). This lower incidence of superinfection was only observed >6 months after initial infection. This is the first adequately powered study to report that HIV infection reduces the risk of reinfection, raising the possibility that immune responses to natural infection are partially protective. The observation that superinfection risk changes with time implies a window of protection that coincides with the maturation of HIV-specific immunity.

show abstract

Section: Methodsmentioning

confidence: 99%

HIV-1 Superinfection Occurs Less Frequently Than Initial Infection in a Cohort of High-Risk Kenyan Women

et al. 2013

Self Cite

View full text Add to dashboard Cite

show abstract

“…Let H be the set of n haplotypes, and let X be the selected k -element subset of H . The objective is then to find X such that the branch-length distance from a randomly chosen haplotype in H to its closest neighboring haplotype in X is minimized over all possible k -element subsets of H ( Matsen et al 2013 ). Note that because the haplotypes in X are also in H , each of these haplotypes is its own closest neighbor, and we can equivalently consider either H or

.…”

Section: Methodsmentioning

confidence: 99%

“…In a detailed study of ADCL, Matsen et al (2013) demonstrated that unlike when choosing the subset that maximizes PD, the greedy algorithm need not give rise to the globally optimal ADCL solution. It is therefore necessary to produce alternative algorithms that seek to minimize ADCL.…”

Section: Methodsmentioning

confidence: 99%

“… Matsen et al (2013) described two algorithms that, for a given set of haplotypes, seek to produce the subset of size k that minimizes ADCL. The first approach leverages similarities between the problem of minimizing ADCL and the technique known as k-medoids clustering ( Kaufman and Rousseeuw 1987 ).…”

Section: Methodsmentioning

confidence: 99%

“…PD can be viewed as emphasizing diversity in the internal reference panel rather than representativeness . To determine whether an alternative focused on identifying the most representative subsample for use as the internal reference panel is preferable, we adapted another method borrowed from phylogenetic studies ( Matsen et al 2013 ): minimizing the average distance to the closest leaf (ADCL), an approach that identifies reference haplotypes based on their genetic proximity to the rest of the sample haplotypes. We compare the imputation accuracy of the maximum-PD, minimum-ADCL, and random reference panels on both simulated data and data from the 1000 Genomes Project, and find that the minimum-ADCL panel consistently provides higher imputation accuracy, irrespective of changes to parameters such as reference panel size, marker density, and sequence length.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf

et al. 2015

View full text Add to dashboard Cite

Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising from genetic dissimilarity between a population of interest and a second, distinct population from which the external reference panel has been constructed. As the cost of next-generation sequencing decreases, internal reference panel selection is becoming increasingly feasible. However, it is not clear how best to select individuals to include in such panels. We introduce a new method for selecting an internal reference panel—minimizing the average distance to the closest leaf (ADCL)—and compare its performance relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides a significant improvement in imputation accuracy, especially for imputation of sites with low-frequency alleles. This improvement in imputation accuracy is robust to changes in reference panel size, marker density, and length of the imputation target region.

show abstract

Comparison of genotype imputation strategies using a combined reference panel for chicken population

Yuan

Huang

et al. 2019

Animal

View full text Add to dashboard Cite

Using whole-genome sequence (WGS) data are supposed to be optimal for genome-wide association studies and genomic predictions. However, sequencing thousands of individuals of interest is expensive. Imputation from single nucleotide polymorphisms panels to WGS data is an attractive approach to obtain highly reliable WGS data at low cost. Here, we conducted a genotype imputation study with a combined reference panel in yellow-feather dwarf broiler population. The combined reference panel was assembled by sequencing 24 key individuals of a yellow-feather dwarf broiler population (internal reference panel) and WGS data from 311 chickens in public databases (external reference panel). Three scenarios were investigated to determine how different factors affect the accuracy of imputation from 600 K array data to WGS data, including: genotype imputation with internal, external and combined reference panels; the number of internal reference individuals in the combined reference panel; and different reference sizes and selection strategies of an external reference panel. Results showed that imputation accuracy from 600 K to WGS data were 0.834±0.012, 0.920±0.007 and 0.982±0.003 for the internal, external and combined reference panels, respectively. Increasing the reference size from 50 to 250 improved the accuracy of genotype imputation from 0.848 to 0.974 for the combined reference panel and from 0.647 to 0.917 for the external reference panel. The selection strategies for the external reference panel had no impact on the accuracy of imputation using the combined reference panel. However, if only an external reference panel with reference size >50 was used, the selection strategy of minimizing the average distance to the closest leaf had the greatest imputation accuracy compared with other methods. Generally, using a combined reference panel provided greater imputation accuracy, especially for low-frequency variants. In conclusion, the optimal imputation strategy with a combined reference panel should comprehensively consider genetic diversity of the study population, availability and properties of external reference panels, sequencing and computing costs, and frequency of imputed variants. This work sheds light on how to design and execute genotype imputation with a combined external reference panel in a livestock population.

show abstract

Minimizing the Average Distance to a Closest Leaf in a Phylogenetic Tree

Cited by 12 publications

References 21 publications

HIV-1 Superinfection Occurs Less Frequently Than Initial Infection in a Cohort of High-Risk Kenyan Women

HIV-1 Superinfection Occurs Less Frequently Than Initial Infection in a Cohort of High-Risk Kenyan Women

Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf

Comparison of genotype imputation strategies using a combined reference panel for chicken population

Contact Info

Product

Resources

About