Regularization of covariance matrices in high dimensions is usually either based on a known ordering of variables or ignores the ordering entirely. This paper proposes a method for discovering meaningful orderings of variables based on their correlations using the Isomap, a non-linear dimension reduction technique designed for manifold embeddings. These orderings are then used to construct a sparse covariance estimator, which is block-diagonal and/or banded. Finding an ordering to which banding can be applied is desirable because banded estimators have been shown to be consistent in high dimensions. We show that in situations where the variables do have such a structure, the Isomap does very well at discovering it, and the resulting regularized estimator performs better for covariance estimation than other regularization methods that ignore variable order, such as thresholding. We also propose a bootstrap approach to constructing the neighborhood graph used by the Isomap, and show it leads to better estimation. We illustrate our method on data on protein consumption, where the variables (food types) have a structure but it cannot be easily described a priori, and on a gene expression data set.
This paper addresses the problem of determining the number of pure chemical components in a mixture by applying the maximum likelihood estimator (MLE) of intrinsic dimension. The application here is to Raman spectroscopy data, although the method is general and can be applied to any type of data from a chemical mixture. We show that the MLE produces superior results compared to other methods on both simulated and real chemical mixtures, and is accurate even when minor components are present. Even if the signal-to-noise (SN) ratio is very low, accurate estimates can still be obtained by smoothing the data before applying the estimator, this approach is illustrated on two real datasets with high noise levels. Since the MLE is computed locally at every data point, we also show how the local estimates can be used for other applications, such as segmenting the specimen into homogeneous regions.
Insights into protein folding rely increasingly on the synergy between experimental and theoretical approaches. Developing successful computational models requires access to experimental data of sufficient quantity and high quality. We compiled folding rate constants for what initially appeared to be 184 proteins from 15 published collections/web databases. To generate the highest confidence in the dataset, we verified the reported lnk f value and exact experimental construct and conditions from the original experimental report(s). The resulting comprehensive database of 126 verified entries, ACPro, will serve as a freely accessible resource (https://www.ats. amherst.edu/protein/) for the protein folding community to enable confident testing of predictive models. In addition, we provide a streamlined submission form for researchers to add new folding kinetics results, requiring specification of all the relevant experimental information according to the standards proposed in 2005 by the protein folding consortium organized by Plaxco. As the number and diversity of proteins whose folding kinetics are studied expands, our curated database will enable efficient and confident incorporation of new experimental results into a standardized collection. This database will support a more robust symbiosis between experiment and theory, leading ultimately to more rapid and accurate insights into protein folding, stability, and dynamics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.