We develop a spatial statistical methodology to design national air pollution monitoring networks with good predictive capabilities while minimizing the cost of monitoring. The underlying complexity of atmospheric processes and the urgent need to give credible assessments of environmental risk create problems requiring new statistical methodologies to meet these challenges. In this work, we present a new method of ranking various subnetworks taking both the environmental cost and the statistical information into account. A Bayesian algorithm is introduced to obtain an optimal subnetwork using an entropy framework. The final network and accuracy of the spatial predictions is heavily dependent on the underlying model of spatial correlation. Usually the simplifying assumption of stationarity, in the sense that the spatial dependency structure does not change location, is made for spatial prediction. However, it is not uncommon to find spatial data that show strong signs of nonstationary behavior. We build upon an existing approach that creates a nonstationary covariance by a mixture of a family of stationary processes, and we propose a Bayesian method of estimating the associated parameters using the technique of Reversible Jump Markov Chain Monte Carlo. We apply these methods for spatial prediction and network design to ambient ozone data from a monitoring network in the eastern US.
Classical dependence measures such as Pearson correlation, Spearman's ρ, and Kendall's τ can detect only monotonic or linear dependence. To overcome these limitations, Székely et al. (2007) proposed distance covariance as a weighted L2 distance between the joint characteristic function and the product of marginal distributions. The distance covariance is 0 if and only if two random vectors X and Y are independent. This measure has the power to detect the presence of a dependence structure when the sample size is large enough. They further showed that the sample distance covariance can be calculated simply from modified Euclidean distances, which typically requires O(n 2 ) cost. The quadratic computing time greatly limits the application of distance covariance to large data. In this paper, we present a simple exact O(n log(n)) algorithm to calculate the sample distance covariance between two univariate random variables. The proposed method essentially consists of two sorting steps, so it is easy to implement. Empirical results show that the proposed algorithm is significantly faster than state-of-the-art methods. The algorithm's speed will enable researchers to explore complicated dependence structures in large datasets.
Support vector data description (SVDD) is a popular technique for detecting anomalies. The SVDD classifier partitions the whole space into an inlier region, which consists of the region near the training data, and an outlier region, which consists of points away from the training data. The computation of the SVDD classifier requires a kernel function, and the Gaussian kernel is a common choice for the kernel function. The Gaussian kernel has a bandwidth parameter, whose value is important for good results. A small bandwidth leads to overfitting, and the resulting SVDD classifier overestimates the number of anomalies. A large bandwidth leads to underfitting, and the classifier fails to detect many anomalies. In this paper we present a new automatic, unsupervised method for selecting the Gaussian kernel bandwidth. The selected value can be computed quickly, and it is competitive with existing bandwidth selection methods.
Abstract-Support Vector Data Description (SVDD)is a machine-learning technique used for single class classification and outlier detection. SVDD formulation with kernel function provides a flexible boundary around data. The value of kernel function parameters affects the nature of the data boundary. For example, it is observed that with a Gaussian kernel, as the value of kernel bandwidth is lowered, the data boundary changes from spherical to wiggly. The spherical data boundary leads to underfitting, and an extremely wiggly data boundary leads to overfitting. In this paper, we propose an empirical criterion to obtain good values of the Gaussian kernel bandwidth parameter. This criterion provides a smooth boundary that captures the essential geometric features of the data.
Support vector data description (SVDD) is a machine learning technique that is used for single-class classification and outlier detection. The idea of SVDD is to find a set of support vectors that defines a boundary around data. When dealing with online or large data, existing batch SVDD methods have to be rerun in each iteration. We propose an incremental learning algorithm for SVDD that uses the Gaussian kernel. This algorithm builds on the observation that all support vectors on the boundary have the same distance to the center of sphere in a higher-dimensional feature space as mapped by the Gaussian kernel function. Each iteration involves only the existing support vectors and the new data point. Moreover, the algorithm is based solely on matrix manipulations; the support vectors and their corresponding Lagrange multiplier αi's are automatically selected and determined in each iteration. It can be seen that the complexity of our algorithm in each iteration is only O(k 2 ), where k is the number of support vectors. Experimental results on some real data sets indicate that FISVDD demonstrates significant gains in efficiency with almost no loss in either outlier detection accuracy or objective function value.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.