Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement

Li, Wen; Zhang, Ying; Sun, Yue; Wang, Wei; Li, Mingjie; Zhang, Wenjie; Lin, Xuemin

doi:10.1109/tkde.2019.2909204

Cited by 243 publications

(194 citation statements)

References 69 publications

Supporting

Mentioning

189

Contrasting

Order By: Relevance

“…To address these challenges, Bioconductor has developed software packages that incorporate recent advance in nearest-neighbors and clustering algorithms that improve computational efficiency through approaches such as using approximate methods instead of exact methods, thereby trading an acceptable amount of accuracy for vastly improved runtimes. For example, the BiocNeighbors package [97][98][99][100] can be used to search for nearest neighbors and then a shared nearest neighbor graph using cells as nodes can be built using the scran [59]. Further, approximate methods have the advantage of smoothing over noise and sparsity, and thus potentially providing a better fit to the data [101].…”

Section: Clusteringmentioning

confidence: 99%

“…BiocNeighbors [97][98][99][100] Exact and approximate methods for nearest neighbor detection that uses the BiocParallel [91] framework to parallelize operations SC3 [102], clusterExperiment [103], SIMLR [104], mbkmeans [106], BEARscc [107], clustree [108] Unsupervised clustering frameworks for single-cell data edgeR [3,62], DESeq2 [7], limma [115] Methods developed for bulk RNA-seq differential expression that can be used in combination with methods such as zinbwave [31,116] to account for the zero-inflation MAST [28], scDD [117], BASiCS [64,65], SCDE [118] Methods to identify differentially expressed features using statistical models that directly model zero-inflation slingshot [126], TSCAN [30], monocle [123,124,127], cellTree [128] Methods for trajectory analysis or pseudotime inference MAST [28], AUCell [141], scmap [77], PADOG [139], fgsea [137], goseq [138], slalom [142], scCoGAPS [143,144], EnrichmentBrowser [140] Methods for gene set / signature enrichment analysis iSEE [148] Interactive data exploration and visualization countsimQC [153], batchQC…”

Section: Downstream Statistical Analysesmentioning

confidence: 99%

See 1 more Smart Citation

Orchestrating Single-Cell Analysis with Bioconductor

Carey

Carpp

Lun

et al. 2019

Preprint

View full text Add to dashboard Cite

Recent developments in experimental technologies such as single-cell RNA sequencing have enabled the profiling a high-dimensional number of genome-wide features in individual cells, inspiring the formation of large-scale data generation projects quantifying unprecedented levels of biological variation at the single-cell level. The data generated in such projects exhibits unique characteristics, including increased sparsity and scale, in terms of both the number of features and the number of samples. Due to these unique characteristics, specialized statistical methods are required along with fast and efficient software implementations in order to successfully derive biological insights. Bioconductor -an open-source, open-development software project based on the R programming language -has pioneered the analysis of such high-throughput, high-dimensional biological data, leveraging a rich history of software and methods development that has spanned the era of sequencing. Featuring state-of-the-art computational methods, standardized data infrastructure, and interactive data visualization tools that are all easily accessible as software packages, Bioconductor has made it possible for a diverse audience to analyze data derived from cutting-edge single-cell assays. Here, we present an overview of single-cell RNA sequencing analysis for prospective users and contributors, highlighting the contributions towards this effort made by Bioconductor. Figure 1: 10 years of Bioconductor in the high-throughput sequencing era. Bioconductor software packages associated with the analysis of sequencing technology were tracked by the total number of packages (left) and the number of distinct IPs (data recorded monthly) visiting their online documentation (right) over the course of ten years. Software packages were uniquely defined by their primary sequencing technology association, with examples of specific terms used for annotation below in parentheses. * Co-second authors. These authors (VJC, LNC, LG, ATLL, FM, KR, DR, CS, LW) contributed equally and are listed alphabetically. † Co-senior authors. These authors (RG, SCH) contributed equally.sparsity, due to biological fluctuations in the measured traits or limited sensitivity in quantifying small numbers of molecules [17][18][19]. In addition, data derived from single-cell assays have revealed more heterogeneity than previously seen [20][21][22][23][24][25][26][27]. This has led to the rapid development of statistical methods to address the increased sparsity and heterogeneity seen in this data [28][29][30][31]. The profound increase in the complexity of data measured at the single-cell level, along with the continued increases in the number of samples measured, have precipitated the need for fundamental changes in data access, management, and infrastructure to make data analyses scalable to empower scientific progress. Specifically, specialized statistical methods along with fast and memory-efficient software implementation are required to reap the full scientific potential of hig...

show abstract

Section: Clusteringmentioning

confidence: 99%

Section: Downstream Statistical Analysesmentioning

confidence: 99%

Orchestrating Single-Cell Analysis with Bioconductor

Carey

Carpp

Lun

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…An approximate neighborhood graph can be constructed substantially more efficiently [22,11]. To improve performance, one can use various graph pruning methods [20,23,13]: In particular, it is not useful to keep neighbors that are close to each other [20,13].…”

Section: Retrieval Algorithmsmentioning

confidence: 99%

“…For a recent experimental comparison of several retrieval approaches see [32]. Although, HNSW is possibly the best retrieval method for generic distances [23,20], in our work we use a modified variant of SW-graph, where retrieval starts from a single point (which is considerably more efficient compared to multiple starting points). The main advantage of HNSW over the older version of SW-graph is due to (1) introduction of pruning heuristics, (2) using a single starting point during retrieval.…”

Section: Retrieval Algorithmsmentioning

confidence: 99%

Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs

Boytsov

Nyberg

2019

Similarity Search and Applications

View full text Add to dashboard Cite

We demonstrate that a graph-based search algorithm-relying on the construction of an approximate neighborhood graph-can directly work with challenging non-metric and/or non-symmetric distances without resorting to metricspace mapping and/or distance symmetrization, which, in turn, lead to substantial performance degradation. Although the straightforward metrization and symmetrization is usually ineffective, we find that constructing an index using a modified, e.g., symmetrized, distance can improve performance. This observation paves a way to a new line of research of designing index-specific graph-construction distance functions. This is an archival version, the publisher's version is available at Springer.com.

show abstract

“…We use the benchmarking system described in [4] as the starting point for our study. Different approaches to benchmarking nearest neighbor search are described in [9,10,20]. We refer to [4] for a detailed comparison between the frameworks.…”

Section: Introductionmentioning

confidence: 99%

The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

Aumüller

Ceccarello

2019

Similarity Search and Applications

View full text Add to dashboard Cite

This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well.

show abstract

Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement

Cited by 243 publications

References 69 publications

Orchestrating Single-Cell Analysis with Bioconductor

Orchestrating Single-Cell Analysis with Bioconductor

Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs

The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

Contact Info

Product

Resources

About