Visualizing Large-scale and High-dimensional Data

Tang, Jian; Liu, Jingzhou; Zhang, M.; Mei, Qiaozhu

doi:10.1145/2872427.2883041

Cited by 318 publications

(271 citation statements)

References 21 publications

(37 reference statements)

Supporting

Mentioning

268

Contrasting

Unclassified

Order By: Relevance

“…Our approach achieves significantly better scalability than t-SNE, which is on the verge of being impractical for datasets with more than a million cells. Although a number of recent studies introduced new techniques for improving the scalability of data visualization tools (Dzwinel and Wcisło, 2015; Tang et al, 2016), they do not address the lack of generalizability that net-SNE overcomes.…”

Section: Discussionmentioning

confidence: 99%

Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks

Cho¹,

Berger

Peng

2018

Cell Systems

View full text Add to dashboard Cite

Visualization algorithms are fundamental tools for interpreting single-cell data. However, standard methods, such as t-stochastic neighbor embedding (t-SNE), are not scalable to datasets with millions of cells and the resulting visualizations cannot be generalized to analyze new datasets. Here we introduce net-SNE, a generalizable visualization approach that trains a neural network to learn a mapping function from high-dimensional single-cell gene-expression profiles to a low-dimensional visualization. We benchmark net-SNE on 13 different datasets, and show that it achieves visualization quality and clustering accuracy comparable with t-SNE. Additionally we show that the mapping function learned by net-SNE can accurately position entire new subtypes of cells from previously unseen datasets and can also be used to reduce the runtime of visualizing 1.3 million cells by 36-fold (from 1.5 days to an hour). Our work provides a framework for bootstrapping single-cell analysis from existing datasets.

show abstract

Section: Discussionmentioning

confidence: 99%

Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks

Cho¹,

Berger

Peng

2018

Cell Systems

View full text Add to dashboard Cite

show abstract

“…Thus, we compared the results between using M versus using M in meaningful downstream visualization applications. Specifically, following previous studies, 17,18 we used t-distributed stochastic neighbor embedding (t-SNE), 19 an algorithm that can efficiently model high-dimensional objects as two-dimensional points, which makes it especially well-suited for visualizing our dataset. We generated our visualization by running t-SNE with default settings on the patient profile matrix M for the baseline and M for VisAGE.…”

Section: Discussionmentioning

confidence: 99%

“…However, it builds a k-NN network directly from the data, and then reduces the network to two dimensions without using external information. 18 Another study built upon LargeVis to visualize single cells, but still also directly computed embeddings from a k-NN network without utilizing external data. 17 Marlin et al visualized a pattern discovery model's clustering parameters in the context of EMR analysis.…”

Section: Related Workmentioning

confidence: 99%

VisAGE: Integrating external knowledge into electronic medical record visualization

2017

View full text Add to dashboard Cite

In this paper, we present VisAGE, a method that visualizes electronic medical records (EMRs) in a low-dimensional space. Effective visualization of new patients allows doctors to view similar, previously treated patients and to identify the new patients' disease subtypes, reducing the chance of misdiagnosis. However, EMRs are typically incomplete or fragmented, resulting in patients who are missing many available features being placed near unrelated patients in the visualized space. VisAGE integrates several external data sources to enrich EMR databases to solve this issue. We evaluated VisAGE on a dataset of Parkinson's disease patients. We qualitatively and quantitatively show that VisAGE can more effectively cluster patients, which allows doctors to better discover patient subtypes and thus improve patient care.

show abstract

“…To speed up the t-SNE analysis, one could use a multicore version that is available via the Rtsne.multicore package. Alternative algorithms, such as ( Tang et al , 2016) (available via the largeVis package), can be used for dimensionality reduction of very large datasets without downsampling. Alternatively, the dimensionality reduction can be performed on the codes of the SOM, at a resolution specified by the user (see Figure 12).…”

Section: Cell Population Identification With Flowsom and Consensusclumentioning

confidence: 99%

CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets

et al. 2017

View full text Add to dashboard Cite

High dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high throughput interrogation and characterization of cell populations.Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g. plots of aggregated signals).

show abstract

Visualizing Large-scale and High-dimensional Data

Cited by 318 publications

References 21 publications

Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks

Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks

VisAGE: Integrating external knowledge into electronic medical record visualization

CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets

Contact Info

Product

Resources

About