Bump-hunting or mode identification is a fundamental problem that arises in almost every scientific field of data-driven discovery. Surprisingly, very few data modeling tools are available for automatic (not requiring manual case-by-case investigation), objective (not subjective), and nonparametric (not based on restrictive parametric model assumptions) mode discovery, which can scale to large data sets. This article introduces LPMode-an algorithm based on a new theory for detecting multimodality of a probability density. We apply LPMode to answer important research questions arising in various fields from environmental science, ecology, econometrics, analytical chemistry to astronomy and cancer genomics.
This paper formulates a penalized empirical likelihood (PEL) method for inference on the population mean when the dimension of the observations may grow faster than the sample size. Asymptotic distributions of the PEL ratio statistic is derived under different component-wise dependence structures of the observations, namely, (i) non-Ergodic, (ii) long-range dependence and (iii) short-range dependence. It follows that the limit distribution of the proposed PEL ratio statistic can vary widely depending on the correlation structure, and it is typically different from the usual chi-squared limit of the empirical likelihood ratio statistic in the fixed and finite dimensional case. A unified subsampling based calibration is proposed, and its validity is established in all three cases, (i)-(iii). Finite sample properties of the method are investigated through a simulation study.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1040 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
There is an overwhelmingly large literature and algorithms already available on "large-scale inference problems" based on different modeling techniques and cultures. Our primary goal in this article is not to add one more new methodology to the existing toolbox but instead (i) to clarify the mystery how these different simultaneous inference methods are connected, (ii) to provide an alternative more intuitive derivation of the formulas that leads to simpler expressions in order (iii) to develop a unified algorithm for practitioners. A detailed discussion on representation, estimation, inference, and model selection is given. Applications to a variety of real and simulated datasets show promise. We end with several future research directions.
Summary High-dimensional $k$-sample comparison is a common task in applications. We construct a class of easy-to-implement distribution-free tests based on new nonparametric tools and unexplored connections with spectral graph theory. The test is shown to have various desirable properties and a characteristic exploratory flavour that has practical consequences for statistical modelling. Numerical examples show that the proposed method works surprisingly well across a broad range of realistic situations.
Consider a big data multiple testing task, where, due to storage and computational bottlenecks, one is given a very large collection of p-values by splitting into manageable chunks and distributing over thousands of computer nodes. This paper is concerned with the following question: How can we find the full data multiple testing solution by operating completely independently on individual machines in parallel, without any data exchange between nodes? This version of the problem tends naturally to arise in a wide range of data-intensive science and industry applications whose methodological solution has not appeared in the literature to date; therefore, we feel it is necessary to undertake such analysis. Based on the nonparametric functional statistical viewpoint of large-scale inference, started in Mukhopadhyay (2016), this paper furnishes a new computing model that brings unexpected simplicity to the design of the algorithm which might otherwise seem daunting using classical approach and notations.°°°°M achine 1 Machine 2 Machine K Figure 1: The data structure and setting of decentralized large-scale inference problem. Massive collection of p-values distributed across large number of computer nodes. be unrealistic due to huge volume (too expensive to store), computational bottleneck † , and possible privacy restrictions. Driven by practical need, the interest for designing Decentralized Large-Scale Inference Engine has enormously increased in the last few years, due to their ability to scale cost effectively as the data volume continued to increase by leveraging modern distributed storage and computing environments. There is, however, apparently no explicit algorithm currently available in the literature to tackle this innocent-looking problem of breaking the multiple testing computation into many pieces, each of which can be processed completely independently on individual machines in parallel.Remark 1. To get a glimpse of the challenge, consider a specific multiple testing method, say the Benjamini Hochberg's (BH) FDR controlling procedure, which starts by calculating the global -rank of each p-value:The computation of global-ranks, from the partitioned p-values, without any communications between the machines, is a highly non-trivial problem. Difficulty with similar caliber also arises in implementing local false discovery type algorithms. † BH (Benjamini and Hochberg, 1995) and HC (Donoho and Jin, 2004) procedures start by ordering the p-values from smallest to largest incurring at least O(N log N ) computational cost and other method like local fdr (Efron et al., 2001) is of even greater complexity O(N 2 ), thereby making legacy multiple testing algorithms infeasible for such massive scale inference problems.
A new comprehensive approach to nonlinear time series analysis and modeling is developed in the present paper. We introduce novel data-specific mid-distribution based Legendre Polynomial (LP) like nonlinear transformations of the original time series {Y (t)} that enables us to adapt all the existing stationary linear Gaussian time series modeling strategy and made it applicable for non-Gaussian and nonlinear processes in a robust fashion. The emphasis of the present paper is on empirical time series modeling via the algorithm LPTime. We demonstrate the effectiveness of our theoretical framework using daily S&P 500 return data between Jan/2/1963 -Dec/31/2009.Our proposed LPTime algorithm systematically discovers all the 'stylized facts' of the financial time series automatically all at once, which were previously noted by many researchers one at a time.
The two key issues of modern Bayesian statistics are: (i) establishing principled approach for distilling statistical prior that is consistent with the given data from an initial believable scientific prior; and (ii) development of a consolidated Bayes-frequentist data analysis workflow that is more effective than either of the two separately. In this paper, we propose the idea of “Bayes via goodness-of-fit” as a framework for exploring these fundamental questions, in a way that is general enough to embrace almost all of the familiar probability models. Several examples, spanning application areas such as clinical trials, metrology, insurance, medicine, and ecology show the unique benefit of this new point of view as a practical data science tool.
To handle the ubiquitous problem of “dependence learning,” copulas are quickly becoming a pervasive tool across a wide range of data‐driven disciplines encompassing neuroscience, finance, econometrics, genomics, social science, machine learning, healthcare, and many more. At the same time, despite their practical value, the empirical methods of “learning copula from data” have been unsystematic with full of case‐specific recipes. Taking inspiration from modern LP‐nonparametrics, this paper presents a modest contribution to the need for a more unified and structured approach of copula modeling that is simultaneously valid for arbitrary combinations of continuous and discrete variables.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.