Motivation Recent developments in technology have enabled researchers to collect multiple OMICS datasets for the same individuals. The conventional approach for understanding the relationships between the collected datasets and the complex trait of interest would be through the analysis of each OMIC dataset separately from the rest, or to test for associations between the OMICS datasets. In this work we show that integrating multiple OMICS datasets together, instead of analysing them separately, improves our understanding of their in-between relationships as well as the predictive accuracy for the tested trait. Several approaches have been proposed for the integration of heterogeneous and high-dimensional (p ≫ n) data, such as OMICS. The sparse variant of Canonical Correlation Analysis (CCA) approach is a promising one that seeks to penalise the canonical variables for producing sparse latent variables while achieving maximal correlation between the datasets. Over the last years, a number of approaches for implementing sparse CCA (sCCA) have been proposed, where they differ on their objective functions, iterative algorithm for obtaining the sparse latent variables and make different assumptions about the original datasets. Results Through a comparative study we have explored the performance of the conventional CCA proposed by Parkhomenko et al. (2009), penalised matrix decomposition CCA proposed by Witten and Tibshirani (2009) and its extension proposed by Suo et al. (2017). The aforementioned methods were modified to allow for different penalty functions. Although sCCA is an unsupervised learning approach for understanding of the in-between relationships, we have twisted the problem as a supervised learning one and investigated how the computed latent variables can be used for predicting complex traits. The approaches were extended to allow for multiple (more than two) datasets where the trait was included as one of the input datasets. Both ways have shown improvement over conventional predictive models that include one or multiple datasets. Availability https://github.com/theorod93/sCCA Supplementary information Supplementary data and material are available at Bioinformatics online.
Manifold learning approaches, such as Stochastic Neighbour Embedding (SNE), Locally Linear Embedding (LLE) and Isometric Feature Mapping (ISOMAP) have been proposed for performing non-linear dimensionality reduction. These methods aim to produce two or three latent embeddings, in order to visualise the data in intelligible representations. This manuscript proposes extensions of Student's t-distributed SNE (t-SNE), LLE and ISOMAP, to allow for dimensionality reduction and subsequent visualisation of multi-view data. Nowadays, it is very common to have multiple data-views on the same samples. Each data-view contains a set of features describing different aspects of the samples. For example, in biomedical studies it is possible to generate multiple OMICS data sets for the same individuals, such as transcriptomics, genomics, epigenomics, enabling better understanding of the relationships between the different biological processes. Through the analysis of real and simulated datasets, the visualisation performance of the proposed methods is illustrated. Data visualisations have been often utilised for identifying any potential clusters in the data sets. We show that by incorporating the low-dimensional embeddings obtained via the multi-view manifold learning approaches into the K-means algorithm, clusters of the samples are accurately identified. Our proposed multi-SNE method outperforms the corresponding multi-ISOMAP and multi-LLE proposed methods. Interestingly, multi-SNE is found to have comparable performance with methods proposed in the literature for performing multi-view clustering.
Motivation: Recent developments in technology have enabled researchers to collect multiple OMICS datasets for the same individuals. The conventional approach for understanding the relationships between the collected datasets and the complex trait of interest would be through the analysis of each OMIC dataset separately from the rest, or to test for associations between the OMICS datasets. In this work we show that by integrating multiple OMICS datasets together, instead of analysing them separately, improves our understanding of their in-between relationships as well as the predictive accuracy for the tested trait. As OMICS datasets are heterogeneous and high-dimensional (p >> n) integrating them can be done through Sparse Canonical Correlation Analysis (sCCA) that penalises the canonical variables for producing sparse latent variables while achieving maximal correlation between the datasets. Over the last years, a number of approaches for implementing sCCA have been proposed, where they differ on their objective functions, iterative algorithm for obtaining the sparse latent variables and make different assumptions about the original datasets. Results: Through a comparative study we have explored the performance of the conventional CCA proposed by Parkhomenko et al. [2009], penalised matrix decomposition CCA proposed by Witten and Tibshirani [2009] and its extension proposed by Suo et al. [2017]. The aferomentioned methods were modified to allow for different penalty functions. Although sCCA is an unsupervised learning approach for understanding of the in-between relationships, we have twisted the problem as a supervised learning one and investigated how the computed latent variables can be used for predicting complex traits. The approaches were extended to allow for multiple (more than two) datasets where the trait was included as one of the input datasets. Both ways have shown improvement over conventional predictive models that include one or multiple datasets. Contact: tr1915@ic.ac.uk
An increasing number of multi-view data are being published by studies in several fields. This type of data corresponds to multiple data-views, each representing a different aspect of the same set of samples. We have recently proposed multi-SNE, an extension of t-SNE, that produces a single visualisation of multi-view data. The multi-SNE approach provides low-dimensional embeddings of the samples, produced by being updated iteratively through the different data-views. Here, we further extend multi-SNE to a semi-supervised approach, that classifies unlabelled samples by regarding the labelling information as an extra data-view. We look deeper into the performance, limitations and strengths of multi-SNE and its extension, S-multi-SNE, by applying the two methods on various multi-view datasets with different challenges. We show that by including the labelling information, the projection of the samples improves drastically and it is accompanied by a strong classification performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.