Data segmentation based on the local intrinsic dimension

Allegra, Michele; Facco, Elena; Denti, Francesco; Laio, Alessandro; Mira, Antonietta

doi:10.1038/s41598-020-72222-0

Cited by 23 publications

(30 citation statements)

References 38 publications

(75 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further assessment of the statistical validity of taking separated centroids as representative of ecological clustering in a PCA setting requires additional work (possibly using recent and promising methods based on the evaluation of the local intrinsic dimension of the data (Allegra et al. 2020 )).

Fig.…”

Section: Resultsmentioning

confidence: 99%

Codon usage bias and environmental adaptation in microbial organisms

2021

View full text Add to dashboard Cite

In each genome, synonymous codons are used with different frequencies; this general phenomenon is known as codon usage bias. It has been previously recognised that codon usage bias could affect the cellular fitness and might be associated with the ecology of microbial organisms. In this exploratory study, we investigated the relationship between codon usage bias, lifestyles (thermophiles vs. mesophiles; pathogenic vs. non-pathogenic; halophilic vs. non-halophilic; aerobic vs. anaerobic and facultative) and habitats (aquatic, terrestrial, host-associated, specialised, multiple) of 615 microbial organisms (544 bacteria and 71 archaea). Principal component analysis revealed that species with given phenotypic traits and living in similar environmental conditions have similar codon preferences, as represented by the relative synonymous codon usage (RSCU) index, and similar spectra of tRNA availability, as gauged by the tRNA gene copy number (tGCN). Moreover, by measuring the average tRNA adaptation index (tAI) for each genome, an index that can be associated with translational efficiency, we observed that organisms able to live in multiple habitats, including facultative organisms, mesophiles and pathogenic bacteria, are characterised by a reduced translational efficiency, consistently with their need to adapt to different environments. Our results show that synonymous codon choices might be under strong translational selection, which modulates the choice of the codons to differently match tRNA availability, depending on the organism’s lifestyle needs. To our knowledge, this is the first large-scale study that examines the role of codon bias and translational efficiency in the adaptation of microbial organisms to the environment in which they live.

show abstract

Fig.…”

Section: Resultsmentioning

confidence: 99%

Codon usage bias and environmental adaptation in microbial organisms

2021

View full text Add to dashboard Cite

show abstract

“…However these results were derived for the simplest uniform euclidean manifold with single global intrinsic dimension, they form a base for application in more complex cases. For example the pdf of the local statistic make possible to apply the FSA estimator within mixture-based approaches, this would provide better ID estimates when the ID is varying in the data set ( Haro, Randall & Sapiro, 2008 ; Allegra et al, 2020 ).…”

Section: Discussionmentioning

confidence: 99%

Manifold-adaptive dimension estimation revisited

Benkő

Stippinger

Rehus

et al. 2022

PeerJ Computer Science

View full text Add to dashboard Cite

Data dimensionality informs us about data complexity and sets limit on the structure of successful signal processing pipelines. In this work we revisit and improve the manifold adaptive Farahmand-Szepesvári-Audibert (FSA) dimension estimator, making it one of the best nearest neighbor-based dimension estimators available. We compute the probability density function of local FSA estimates, if the local manifold density is uniform. Based on the probability density function, we propose to use the median of local estimates as a basic global measure of intrinsic dimensionality, and we demonstrate the advantages of this asymptotically unbiased estimator over the previously proposed statistics: the mode and the mean. Additionally, from the probability density function, we derive the maximum likelihood formula for global intrinsic dimensionality, if i.i.d. holds. We tackle edge and finite-sample effects with an exponential correction formula, calibrated on hypercube datasets. We compare the performance of the corrected median-FSA estimator with kNN estimators: maximum likelihood (Levina-Bickel), the 2NN and two implementations of DANCo (R and MATLAB). We show that corrected median-FSA estimator beats the maximum likelihood estimator and it is on equal footing with DANCo for standard synthetic benchmarks according to mean percentage error and error rate metrics. With the median-FSA algorithm, we reveal diverse changes in the neural dynamics while resting state and during epileptic seizures. We identify brain areas with lower-dimensional dynamics that are possible causal sources and candidates for being seizure onset zones.

show abstract

“…In 3D, this approach can be used for object detection (see Figure 1), but it can be generalized for higher-dimensional data point clouds. Interestingly, local ID can be related to various object characteristics in various domains: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets [74].…”

Section: Discussionmentioning

confidence: 99%

Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation

Bac

Mirkes

Gorban

et al. 2021

Entropy

View full text Add to dashboard Cite

Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note introduces scikit-dimension, an open-source Python package for intrinsic dimension estimation. The scikit-dimension package provides a uniform implementation of most of the known ID estimators based on the scikit-learn application programming interface to evaluate the global and local intrinsic dimension, as well as generators of synthetic toy and benchmark datasets widespread in the literature. The package is developed with tools assessing the code quality, coverage, unit testing and continuous integration. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation for real-life and synthetic data.

show abstract

Data segmentation based on the local intrinsic dimension

Cited by 23 publications

References 38 publications

Codon usage bias and environmental adaptation in microbial organisms

Codon usage bias and environmental adaptation in microbial organisms

Manifold-adaptive dimension estimation revisited

Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation

Contact Info

Product

Resources

About