Data-dependent metrics are powerful tools for learning the underlying structure of high-dimensional data. This article develops and analyzes a data-dependent metric known as diffusion state distance (DSD), which compares points using a data-driven diffusion process. Unlike related diffusion methods, DSDs incorporate information across time scales, which allows for the intrinsic data structure to be inferred in a parameterfree manner. This article develops a theory for DSD based on the multitemporal emergence of mesoscopic equilibria in the underlying diffusion process. New algorithms for denoising and dimension reduction with DSD are also proposed and analyzed. These approaches are based on a weighted spectral decomposition of the underlying diffusion process, and experiments on synthetic datasets and real biological networks illustrate the efficacy of the proposed algorithms in terms of both speed and accuracy. Throughout, comparisons with related methods are made, in order to illustrate the distinct advantages of DSD for datasets exhibiting multiscale structure.
Introduction.Metrics for pairwise comparisons of data points X = {x i } n i=1 ⊂ R D are an essential tool in a wide range of data analysis tasks including classification, regression, clustering, and visualization [24]. Euclidean distances, and more generally data-independent metrics that depend only on the original coordinates of the data, may be inadequate for high-dimensional data due to the curse of dimensionality [60] or for low-dimensional data with nonlinear correlations. In order to address these concerns, metrics derived from the global structure of the data are necessary.A family of data-dependent metrics based on local similarity graphs has been developed to address this problem [29,58,53,3,4,21,17,16]. These methods typically construct an undirected, weighted graph G with nodes corresponding to X and weights between nodes x i , x j given by W ij = K(x i , x j ) for a suitable kernel function that is usually radial and rapidly decaying. Though constituted from local relationships among the points of X (due to the rapid decay of K), the global features of X may be gleaned from G by considering partial differential operators on G (e.g. Laplacian or Schrödinger operators), geodesics in the path space, or diffusion processes on X. These new metrics may vastly improve over data-independent metrics for a range of tasks including supervised classification and regression, unsupervised clustering, low-dimensional embeddings, and the visualization of high-dimensional data.While powerful, these methods may not adequately capture multiscale structure in the data. For example, when there are both small and large clusters in the underlying data, or geometric features at multiple scales, it may be challenging to select without supervision the optimal low-dimensional representation or parameters for these data-dependent metrics. In particular, when considering diffusion processes on graphs, the time scale crucially determines the granularity of the subsequent...