Théo Trouillon scite author profile

Inference of individual ancestry coefficients, which is important for population genetic and association studies, is commonly performed using computer-intensive likelihood algorithms. With the availability of large population genomic data sets, fast versions of likelihood algorithms have attracted considerable attention. Reducing the computational burden of estimation algorithms remains, however, a major challenge. Here, we present a fast and efficient method for estimating individual ancestry coefficients based on sparse nonnegative matrix factorization algorithms. We implemented our method in the computer program sNMF and applied it to human and plant data sets. The performances of sNMF were then compared to the likelihood algorithm implemented in the computer program ADMIXTURE. Without loss of accuracy, sNMF computed estimates of ancestry coefficients with runtimes 10-30 times shorter than those of ADMIXTURE. INFERENCE of population structure from multilocus genotype data is commonly performed using likelihood methods implemented in the computer programs STRUCTURE, FRAPPE, and ADMIXTURE (Pritchard et al. 2000a;Tang et al. 2005;Alexander et al. 2009). These programs compute probabilistic quantities called ancestry coefficients that represent the proportions of an individual genome that originate from multiple ancestral gene pools. Estimation of ancestry proportions is important in many respects, for example in delineating genetic clusters, drawing inference about the history of a species, screening genomes for signatures of natural selection, and performing statistical corrections in genome-wide association studies (Pritchard et al. 2000b;Marchini et al. 2004;Price et al. 2006;Frichot et al. 2013).Individual ancestry coefficients can be estimated using either supervised or unsupervised statistical methods. Supervised estimation methods use predefined source populations as ancestral populations. Classical supervised estimation approaches were based on least-squares regression of allele frequencies in hybrid and source populations (Roberts and Hiorns 1965;Cavalli-Sforza and Bodmer 1971). Unsupervised approaches attempt to infer ancestral gene pools from the data, using likelihood methods. An undesired feature of likelihood methods is that they can be computer intensive, with typical runs lasting several hours or more. With the use of dense genomic data and increased sample sizes, reducing the time lag necessary to perform estimation is a major challenge of population genetic data analysis.A fast approach to the estimation of ancestry coefficients is by using principal component analysis (PCA) . PCA is an exploratory method that describes high-dimensional data, using a small number of dimensions, and makes no assumptions about sampled and ancestral populations. Using PCA can lead to results surprisingly close to likelihood methods, and connections between methods have been intensively investigated during recent years Engelhardt and Stephens 2010;Frichot et al. 2012;. But a drawback of PCA is that interpr...

show abstract

On Inductive Abilities of Latent Factor Models for Relational Learning

Trouillon¹,

Gaussier²,

Dance³

et al. 2019

jair

View full text Add to dashboard Cite

Latent factor models are increasingly popular for modeling multi-relational knowledge graphs. By their vectorial nature, it is not only hard to interpret why this class of models works so well, but also to understand where they fail and how they might be improved. We conduct an experimental survey of state-of-the-art models, not towards a purely comparative end, but as a means to get insight about their inductive abilities. To assess the strengths and weaknesses of each model, we create simple tasks that exhibit first, atomic properties of binary relations, and then, common inter-relational inference through synthetic genealogies. Based on these experimental results, we propose new research directions to improve on existing models.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Théo Trouillon

Fast and Efficient Estimation of Individual Ancestry Coefficients

On Inductive Abilities of Latent Factor Models for Relational Learning

Contact Info

Product

Resources

About