Visualizing Population Structure with Variational Autoencoders

Battey, C. J.; Coffing, Gabrielle C.; Kern, Andrew D.

doi:10.1101/2020.08.12.248278

Cited by 7 publications

(3 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some early efforts used machine learning to account for issues that arise with high-dimensional summary statistics [5–7]. More recently, machine learning approaches have used various forms of convolutional, recurrent, and “deep” neural networks to improve inference and visualization [8–14]. One of the goals of moving to these approaches was to enable inference frameworks to operate on the “raw” data (genotype matrices), which avoids the loss of information that comes from reducing genotypes to summary statistics.…”

Section: Introductionmentioning

confidence: 99%

Automatic inference of demographic parameters using Generative Adversarial Networks

Wang

Kourakos

et al. 2020

Preprint

View full text Add to dashboard Cite

Population genetics relies heavily on simulated data for validation, inference, and intuition. In particular, since real data is always limited, simulated data is crucial for training machine learning methods. Simulation software can accurately model evolutionary processes, but requires many hand-selected input parameters. As a result, simulated data often fails to mirror the properties of real genetic data, which limits the scope of methods that rely on it. In this work, we develop a novel approach to estimating parameters in population genetic models that automatically adapts to data from any population. Our method is based on a generative adversarial network that gradually learns to generate realistic synthetic data. We demonstrate that our method is able to recover input parameters in a simulated isolation-with-migration model. We then apply our method to human data from the 1000 Genomes Project, and show that we can accurately recapitulate the features of real data.

show abstract

Section: Introductionmentioning

confidence: 99%

Automatic inference of demographic parameters using Generative Adversarial Networks

Wang

Kourakos

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…We see this as an inherent problem relating to data structure. Previous comparisons of t -SNE found low fidelity with global data patterns, and latent space distances were poor proxies for ‘true’ among-group distances, particularly when compared to VAE (Becht et al 2019; Battey et al 2020). This potentially explains our observed ‘plateau’ of mean optimal K and SD in the t -SNE perplexity grid-search, in that perplexity defines relative weighting of local versus global components (Wattenberg et al 2016).…”

Section: Discussionmentioning

confidence: 89%

The choices we make and the impacts they have: Machine learning and species delimitation in North American box turtles (Terrapenespp.)

Martin

Chafin

Douglas

et al. 2020

Preprint

View full text Add to dashboard Cite

Model-based approaches to species delimitation are constrained both by 22 computational capacities as well as by algorithmic assumptions that are frequently violated when 23 applied to biologically complex systems. An alternate approach, demonstrated herein, employs 24 machine learning (=ML) approaches from which species limits are derived without an explicit 25 definition of an underlying species model. By doing so, we demonstrate the capacity of these 26 approaches to designate phylogenomically and biologically relevant groups, using North 27 American box turtles (Terrapene spp.) as an example. Several different ML-based and traditional 28 species delimitation algorithms were invoked to parse a large SNP dataset derived from ddRAD 29 sequencing. Our results illuminate two major findings. First, more traditional model-based 30 approaches perform poorly, a likely reflection of systematic biases inherent in their formulation. 31Multispecies coalescent methods consistently over-split Terrapene, particularly given prior 32 evidence and our own phylogenetic results. Second, results from ML and clustering algorithms 33 consistently reiterated the presence of clades that were well-supported in prior species tree 34 analyses. In summary, we highlight both the strengths and limitations of ML algorithms, and in 35 doing so, explore appropriate approaches to data manipulation and model fit. Our study was 36 accomplished within the context of a well-characterized empirical system that allowed a direct 37 contrast between ML versus traditional approaches. It allowed the utility of ML-methods to be 38 underscored for species delimitation and serves as a study case from which guidelines implicit to 39 ML methods could be applied to other study systems. 40 41

show abstract

“…Various approaches have been developed to approximate what an autoencoder learns. Most commonly, this involves visualisation of the latent dimension, revealing possible clusters or regions of interest [20][21][22]. While autoencoders are frequently being applied to DNA methylation data [10,11,16] little work has been conducted on interpreting individual latent features and exploring, for example, which CpGs share a relation through common latent features.…”

Section: Introductionmentioning

confidence: 99%

mEthAE: an Explainable AutoEncoder for methylation data

Katz

Santos

Saccenti

et al. 2023

Preprint

View full text Add to dashboard Cite

Despite the wealth of knowledge generated through epigenome-wide association studies our understanding of the relationships of CpG sites is still limited, as analysis of DNA methylation data remains difficult due its high dimensionality. To combat this problem, deep learning algorithms, such as autoencoders, are increasingly applied to capture the complex patterns and reduce dimensionality into latent space. We believe that the way an autoencoder groups together CpGs in its latent dimensions has biological meaning and might reveal novel insights regarding the relationship of CpGs. Therefore, in this work, we propose a chromosome-wise autoencoder for interpretable dimensionality reduction of methylation data (mEthAE). Our framework shows an impressive reduction in dimensions of up to 400-fold compared to the provided input, without compromising on reconstruction accuracy or predictive power in the latent space. Through our perturbation-based interpretability approach we revealed groups of CpGs which are highly connected across all latent dimensions (global CpGs) and were significantly more often reported in EWAS studies, indicating our interpretability method can successfully identify CpGs with biological relevance. In an attempt to gain a deeper understanding of the relationship between individual CpG sites, we focused on interpreting individual latent features and found that CpGs connected to a common feature do not share biological associations, correlation patterns, or are located in close proximity on the chromosome. We conclude that while there is evidence that the autoencoder does not group CpGs randomly, the logic behind the observed CpG relationships can not be delineated easily. With regards to the analyses done in this work, we believe that the autoencoder groups CpGs according to long range non-linear interaction patterns that lack characterisation in the current epigenetic research landscape.

show abstract

Visualizing Population Structure with Variational Autoencoders

Cited by 7 publications

References 54 publications

Automatic inference of demographic parameters using Generative Adversarial Networks

Automatic inference of demographic parameters using Generative Adversarial Networks

The choices we make and the impacts they have: Machine learning and species delimitation in North American box turtles (Terrapenespp.)

mEthAE: an Explainable AutoEncoder for methylation data

Contact Info

Product

Resources

About