We show how the massive data compression algorithm MOPED can be used to reduce, by orders of magnitude, the number of simulated datasets that are required to estimate the covariance matrix required for the analysis of gaussian-distributed data. This is relevant when the covariance matrix cannot be calculated directly. The compression is especially valuable when the covariance matrix varies with the model parameters.In this case, it may be prohibitively expensive to run enough simulations to estimate the full covariance matrix throughout the parameter space. This compression may be particularly valuable for the next-generation of weak lensing surveys, such as proposed for Euclid and LSST, for which the number of summary data (such as band power or shear correlation estimates) is very large, ∼ 10 4 , due to the large number of tomographic redshift bins that the data will be divided into. In the pessimistic case where the covariance matrix is estimated separately for all points in an MCMC analysis, this may require an unfeasible 10 9 simulations. We show here that MOPED can reduce this number by a factor of 1000, or a factor of ∼ 10 6 if some regularity in the covariance matrix is assumed, reducing the number of simulations required to a manageable 10 3 , making an otherwise intractable analysis feasible.
Explainability in machine learning is crucial for iterative model development, compliance with regulation, and providing operational nuance to model predictions. Shapley values provide a general framework for explainability by attributing a model's output prediction to its input features in a mathematically principled and model-agnostic way. However, practical implementations of the Shapley framework make an untenable assumption: that the model's input features are uncorrelated. In this work, we articulate the dangers of this assumption and introduce two solutions for computing Shapley explanations that respect the data manifold. One solution, based on generative modelling, provides flexible access to on-manifold data imputations, while the other directly learns the Shapley value function in a supervised way, providing performance and stability at the cost of flexibility. While the commonly used "off-manifold" Shapley values can (i) break symmetries in the data, (ii) give rise to misleading wrong-sign explanations, and (iii) lead to uninterpretable explanations in high-dimensional data, our approach to on-manifold explainability demonstrably overcomes each of these problems.
In studies of the interstellar medium in galaxies, radiative transfer models of molecular emission are useful for relating molecular line observations back to the physical conditions of the gas they trace. However, doing this requires solving a highly degenerate inverse problem. In order to alleviate these degeneracies, the abundances derived from astrochemical models can be converted into column densities and fed into radiative transfer models. This enforces that the molecular gas composition used by the radiative transfer models be chemically realistic. However, because of the complexity and long running time of astrochemical models, it can be difficult to incorporate chemical models into the radiative transfer framework. In this paper, we introduce a statistical emulator of the UCLCHEM astrochemical model, built using neural networks. We then illustrate, through examples of parameter estimations, how such an emulator can be applied on real and synthetic observations.
Modern astronomical surveys are observing spectral data for millions of stars. These spectra contain chemical information that can be used to trace the Galaxy’s formation and chemical enrichment history. However, extracting the information from spectra and making precise and accurate chemical abundance measurements is challenging. Here we present a data-driven method for isolating the chemical factors of variation in stellar spectra from those of other parameters (i.e., T eff, log g, [Fe/H]). This enables us to build a spectral projection for each star with these parameters removed. We do this with no ab initio knowledge of elemental abundances themselves and hence bypass the uncertainties and systematics associated with modeling that rely on synthetic stellar spectra. To remove known nonchemical factors of variation, we develop and implement a neural network architecture that learns a disentangled spectral representation. We simulate our recovery of chemically identical stars using the disentangled spectra in a synthetic APOGEE-like data set. We show that this recovery declines as a function of the signal-to-noise ratio but that our neural network architecture outperforms simpler modeling choices. Our work demonstrates the feasibility of data-driven abundance-free chemical tagging.
The importance of explainability in machine learning continues to grow, as both neural-network architectures and the data they model become increasingly complex. Unique challenges arise when a model's input features become high dimensional: on one hand, principled model-agnostic approaches to explainability become too computationally expensive; on the other, more efficient explainability algorithms lack natural interpretations for general users. In this work, we introduce a framework for human-interpretable explainability on high-dimensional data, consisting of two modules. First, we apply a semantically-meaningful latent representation, both to reduce the raw dimensionality of the data, and to ensure its human interpretability. These latent features can be learnt, e.g. explicitly as disentangled representations or implicitly through image-to-image translation, or they can be based on any computable quantities the user chooses. Second, we adapt the Shapley paradigm for model-agnostic explainability to operate on these latent features. This leads to interpretable model explanations that are both theoreticallycontrolled and computationally-tractable. We benchmark our approach on synthetic data and demonstrate its effectiveness on several image-classification tasks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.