For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this paper, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. We first establish consistency and asymptotic normality of the estimator from a general subsampling algorithm, and then derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator. An alternative minimization criterion is also proposed to further reduce the computational cost. The optimal subsampling probabilities depend on the full data estimate, so we develop a two-step algorithm to approximate the optimal subsampling procedure. This algorithm is computationally efficient and has a significant reduction in computing time compared to the full data approach. Consistency and asymptotic normality of the estimator from a two-step algorithm are also established. Synthetic and real data sets are used to evaluate the practical performance of the proposed method.
We prove that the generalized Poisson distribution GP(theta, eta) (eta > or = 0) is a mixture of Poisson distributions; this is a new property for a distribution which is the topic of the book by Consul (1989). Because we find that the fits to count data of the generalized Poisson and negative binomial distributions are often similar, to understand their differences, we compare the probability mass functions and skewnesses of the generalized Poisson and negative binomial distributions with the first two moments fixed. They have slight differences in many situations, but their zero-inflated distributions, with masses at zero, means and variances fixed, can differ more. These probabilistic comparisons are helpful in selecting a better fitting distribution for modelling count data with long right tails. Through a real example of count data with large zero fraction, we illustrate how the generalized Poisson and negative binomial distributions as well as their zero-inflated distributions can be discriminated.
The distribution of an organism species in the environment deviates frequently from randomness due to natural cycles, availability of food resources and avoidance of harm. As a result, observed data can show over-dispersion, zero-inflation and even heavy tail. Models such as the negative binomial (NB), Poisson-inverse Gaussian (PIG), and zero-inflated Poisson are frequently used in applications instead of the Poisson distribution which is usually the default model. This paper uses a three-parameter discrete distribution that unifies distributions such as Poisson, NB, PIG, Neyman Type A, and PoissonPascal. The three-parameter family covers a wide range of tail heaviness relative to NB, and thus suitable for modelling over-dispersed count data with a shorter or longer tail. Moreover, it shows some capacity for zero-inflated data. Grouped counts of coliform bacteria from Lake Erie and counts of European corn borer larvae in field corn are used to illustrate the application of the model and the associated likelihood-based inferences.
We obtain new models and results for count data time series based on binomial thinning. Count data time series may have non-stationarity from trends or covariates, so we propose an extension of stationary time series based on binomial thinning such that the univariate marginal distributions are always in the same parametric family, such as negative binomial. We propose a recursive algorithm to calculate the probability mass functions for the innovation random variable associated with binomial thinning. This simplifies numerical calculations and estimation for the classes of time series models that we consider. An application with real data is used to illustrate the models. Copyright 2006 The Authors Journal compilation 2006 Blackwell Publishing Ltd.
A new method to detect different linear structures in a data set, called Linear Grouping Algorithm (LGA), is proposed.LGA is useful for investigating potential linear patterns in data sets, that is, subsets that follow different linear relationships.LGA combines ideas from principal components, clustering methods and resampling algorithms. It can detect several different linear relations at once. Methods to determine the number of groups in the data are proposed. Diagnostic tools to investigate the results obtained from LGA are introduced. It is shown how LGA can be extended to detect groups characterized by lower dimensional hyperplanes as well. Some applications illustrate the usefulness of LGA in practice.
An increasing number of machine learning tasks require dealing with large graph datasets, which capture rich and complex relationship among potentially billions of elements. Graph Neural Network (GNN) becomes an effective way to address the graph learning problem by converting the graph data into a low dimensional space while keeping both the structural and property information to the maximum extent and constructing a neural network for training and referencing. However, it is challenging to provide an efficient graph storage and computation capabilities to facilitate GNN training and enable development of new GNN algorithms. In this paper, we present a comprehensive graph neural network system, namely AliGraph, which consists of distributed graph storage, optimized sampling operators and runtime to efficiently support not only existing popular GNNs but also a series of in-house developed ones for different scenarios. The system is currently deployed at Alibaba to support a variety of business scenarios, including product recommendation and personalized search at Alibaba's E-Commerce platform. By conducting extensive experiments on a real-world dataset with 492.90 million vertices, 6.82 billion edges and rich attributes, Ali-Graph performs an order of magnitude faster in terms of graph building (5 minutes vs hours reported from the state-of-the-art Pow-erGraph platform). At training, AliGraph runs 40%-50% faster with the novel caching strategy and demonstrates around 12 times speed up with the improved runtime. In addition, our in-house developed GNN models all showcase their statistically significant superiorities in terms of both effectiveness and efficiency (e.g., 4.12%-17.19% lift by F1 scores).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.