DADApy: Distance-based analysis of data-manifolds in Python

Glielmo, Aldo; Macocco, Iuri; Doimo, Diego; Carli, Matteo; Zeni, Claudio; Wild, Romina; D’Errico, M.; Rodríguez, Álex; Laio, Alessandro

doi:10.1016/j.patter.2022.100589

Cited by 14 publications

(7 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We first employed standard information imbalance 15 to find pairwise informative relationships between features. By this approach we were able to identify features with plain correlation, such as hematocrit (EHCT) and hemoglobin (EHB), as well as asymmetric correlations in which one feature holds more information about the other than v.v.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

High dimensional feature selection using informative distance measures: Application to COVID-19 severity prediction

Wild

Sozio

Margiotta

et al. 2022

Preprint

View full text Add to dashboard Cite

Clinical data bases typically include, for each patient, many heterogeneous features, for example blood exams, the clinical history before the onset of the disease, the evolution of the symptoms, the results of imaging exams, and many others. Using subsets of these features, one can measure the similarity between two patients in several different manners. We here propose to exploit a recently developed statistical approach, the information imbalance, to compare these different similarity measures, and quantify their relative information content. We apply this approach to a data set of ~ 1,300 COVID-19 patients in Udine hospital before October 2021. Using this approach we find (asymmetric) relationships between single features and systematically compare subsets of up to 20 different features as COVID-19 severity predictors. The identified features can be measured at the moment of the admission of the patient and, if used in combination, are maximally informative of the clinical fate and of the severity of the disease. The approach can be used also if the features are available only for a fraction of the patients and, importantly, is able to select automatically features with small inter-feature correlation.

show abstract

Section: Discussionmentioning

confidence: 99%

“…12 . We computed the information imbalance ∆ between each pair of features using the implementation in the Python package DADApy 15 . ∆(A → B) is close to zero if feature A predicts feature B well.…”

Section: Information Imbalance Between Input Featuresmentioning

confidence: 99%

High dimensional feature selection using informative distance measures: Application to COVID-19 severity prediction

Wild

Sozio

Margiotta

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The nonlinear approaches are geometrical methods based on nearest neighbor (NN) statistics (Facco et al, 2017; Denti et al, 2022), implemented in Glielmo et al (2022). These methods are meant to be applied in a range of scales: at each scale, determined by the neighbors’ rank and/or by the number of data points that enter the calculation, they return the estimated number of “soft” directions, that is, the number of directions in which the features of the dataset change remarkably, as opposed to “noise” directions characterized by small variations.…”

Section: Analyses Of the Relationships Between Measures And Between T...mentioning

confidence: 99%

Synchrony, oscillations, and phase relationships in collective neuronal activity: a highly comparative overview of methods

Baroni,

Fulcher

2024

Preprint

View full text Add to dashboard Cite

Neuronal activity is organized in collective patterns that are critical for information coding, generation, and communication between brain areas. These patterns are often described in terms of synchrony, oscillations, and phase relationships. Many methods have been proposed for the quantification of these collective states of dynamic neuronal organization. However, it is difficult to determine which method is best suited for which experimental setting and research question. This choice is further complicated by the fact that most methods are sensitive to a combination of synchrony, oscillations, and other factors; in addition, some of them display systematic biases that can complicate their interpretation. To address these challenges, we adopt a highly comparative approach, whereby spike trains are represented by a diverse library of measures. This enables unsupervised or supervised classification in the space of measures, or in that of spike trains. We compile a battery of 122 measures of synchrony, oscillations, and phase relationships, complemented with 9 measures of spiking intensity and variability. We first apply them to sets of synthetic spike trains with known statistical properties, and show that all measures are confounded by extraneous factors such as firing rate or population frequency, but to different extents. Then, we analyze spike trains recorded in different species---rat, mouse, and monkey---and brain areas---primary sensory cortices and hippocampus---and show that our highly comparative approach provides a high-dimensional quantification of collective network activity that can be leveraged for both unsupervised and supervised classification of firing patterns. Overall, the highly comparative approach provides a detailed description of the empirical properties of multineuron spike train analysis methods, including practical guidelines for their use in experimental settings, and advances our understanding of neuronal coordination and coding.

show abstract

“…However, one can only find it on GitHub repositories, coded in Python and C++. Note that Python versions of the TWO-NN estimator have also been implemented in the recent scikit-dimension and DADApy packages (Bac et al 2021;Glielmo et al 2022). Moreover, DADApy contains routines dedicated to GRIDE.…”

Section: B Intrinsic and Other Packagesmentioning

confidence: 99%

“…Among the various options, the package Rdimtools (You 2022a) stands out, implementing 150 different algorithms, 17 of which are exclusively dedicated to ID estimation (You 2022b). Finally, it is worth mentioning that there are also Python (Van Rossum et al 2011) packages implementing different methods for ID estimation: two prominent examples are scikit-learn (Bac et al 2021) and DADApy (Glielmo et al 2022). See Appendix B for more details.…”

Section: Introductionmentioning

confidence: 99%

intRinsic: An R Package for Model-Based Estimation of the Intrinsic Dimension of a Dataset

Denti

2023

J. Stat. Soft.

View full text Add to dashboard Cite

This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.

show abstract

DADApy: Distance-based analysis of data-manifolds in Python

Cited by 14 publications

References 43 publications

High dimensional feature selection using informative distance measures: Application to COVID-19 severity prediction

High dimensional feature selection using informative distance measures: Application to COVID-19 severity prediction

Synchrony, oscillations, and phase relationships in collective neuronal activity: a highly comparative overview of methods

intRinsic: An R Package for Model-Based Estimation of the Intrinsic Dimension of a Dataset

Contact Info

Product

Resources

About