Rong Zhu scite author profile

For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this paper, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. We first establish consistency and asymptotic normality of the estimator from a general subsampling algorithm, and then derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator. An alternative minimization criterion is also proposed to further reduce the computational cost. The optimal subsampling probabilities depend on the full data estimate, so we develop a two-step algorithm to approximate the optimal subsampling procedure. This algorithm is computationally efficient and has a significant reduction in computing time compared to the full data approach. Consistency and asymptotic normality of the estimator from a two-step algorithm are also established. Synthetic and real data sets are used to evaluate the practical performance of the proposed method.

show abstract

Generalized Poisson Distribution: the Property of Mixture of Poisson and Comparison with Negative Binomial Distribution

Joe

2005

View full text Add to dashboard Cite

We prove that the generalized Poisson distribution GP(theta, eta) (eta > or = 0) is a mixture of Poisson distributions; this is a new property for a distribution which is the topic of the book by Consul (1989). Because we find that the fits to count data of the generalized Poisson and negative binomial distributions are often similar, to understand their differences, we compare the probability mass functions and skewnesses of the generalized Poisson and negative binomial distributions with the first two moments fixed. They have slight differences in many situations, but their zero-inflated distributions, with masses at zero, means and variances fixed, can differ more. These probabilistic comparisons are helpful in selecting a better fitting distribution for modelling count data with long right tails. Through a real example of count data with large zero fraction, we illustrate how the generalized Poisson and negative binomial distributions as well as their zero-inflated distributions can be discriminated.

show abstract

Modelling species abundance using the Poisson-Tweedie family

2011

View full text Add to dashboard Cite

The distribution of an organism species in the environment deviates frequently from randomness due to natural cycles, availability of food resources and avoidance of harm. As a result, observed data can show over-dispersion, zero-inflation and even heavy tail. Models such as the negative binomial (NB), Poisson-inverse Gaussian (PIG), and zero-inflated Poisson are frequently used in applications instead of the Poisson distribution which is usually the default model. This paper uses a three-parameter discrete distribution that unifies distributions such as Poisson, NB, PIG, Neyman Type A, and PoissonPascal. The three-parameter family covers a wide range of tail heaviness relative to NB, and thus suitable for modelling over-dispersed count data with a shorter or longer tail. Moreover, it shows some capacity for zero-inflated data. Grouped counts of coliform bacteria from Lake Erie and counts of European corn borer larvae in field corn are used to illustrate the application of the model and the associated likelihood-based inferences.

show abstract

Top-k Reliability Search on Uncertain Graphs

Zhu

Zou

2015

View full text Add to dashboard Cite

Modelling Count Data Time Series with Markov Processes Based on Binomial Thinning

Zhu

Joe

2006

Journal Time Series Analysis

View full text Add to dashboard Cite

We obtain new models and results for count data time series based on binomial thinning. Count data time series may have non-stationarity from trends or covariates, so we propose an extension of stationary time series based on binomial thinning such that the univariate marginal distributions are always in the same parametric family, such as negative binomial. We propose a recursive algorithm to calculate the probability mass functions for the innovation random variable associated with binomial thinning. This simplifies numerical calculations and estimation for the classes of time series models that we consider. An application with real data is used to illustrate the models. Copyright 2006 The Authors Journal compilation 2006 Blackwell Publishing Ltd.

show abstract

Linear grouping using orthogonal regression

Aelst

Wang

Zamar

et al. 2006

Computational Statistics & Data Analysis

View full text Add to dashboard Cite

A new method to detect different linear structures in a data set, called Linear Grouping Algorithm (LGA), is proposed.LGA is useful for investigating potential linear patterns in data sets, that is, subsets that follow different linear relationships.LGA combines ideas from principal components, clustering methods and resampling algorithms. It can detect several different linear relations at once. Methods to determine the number of groups in the data are proposed. Diagnostic tools to investigate the results obtained from LGA are introduced. It is shown how LGA can be extended to detect groups characterized by lower dimensional hyperplanes as well. Some applications illustrate the usefulness of LGA in practice.

show abstract

Negative binomial time series models based on expectation thinning operators

Zhu

Joe

2010

Journal of Statistical Planning and Inference

View full text Add to dashboard Cite

AliGraph: A Comprehensive Graph Neural Network Platform

Zhu

Zhao

Yang

et al. 2019

Preprint

View full text Add to dashboard Cite

An increasing number of machine learning tasks require dealing with large graph datasets, which capture rich and complex relationship among potentially billions of elements. Graph Neural Network (GNN) becomes an effective way to address the graph learning problem by converting the graph data into a low dimensional space while keeping both the structural and property information to the maximum extent and constructing a neural network for training and referencing. However, it is challenging to provide an efficient graph storage and computation capabilities to facilitate GNN training and enable development of new GNN algorithms. In this paper, we present a comprehensive graph neural network system, namely AliGraph, which consists of distributed graph storage, optimized sampling operators and runtime to efficiently support not only existing popular GNNs but also a series of in-house developed ones for different scenarios. The system is currently deployed at Alibaba to support a variety of business scenarios, including product recommendation and personalized search at Alibaba's E-Commerce platform. By conducting extensive experiments on a real-world dataset with 492.90 million vertices, 6.82 billion edges and rich attributes, Ali-Graph performs an order of magnitude faster in terms of graph building (5 minutes vs hours reported from the state-of-the-art Pow-erGraph platform). At training, AliGraph runs 40%-50% faster with the novel caching strategy and demonstrates around 12 times speed up with the improved runtime. In addition, our in-house developed GNN models all showcase their statistically significant superiorities in terms of both effectiveness and efficiency (e.g., 4.12%-17.19% lift by F1 scores).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Rong Zhu

Optimal Subsampling for Large Sample Logistic Regression

Generalized Poisson Distribution: the Property of Mixture of Poisson and Comparison with Negative Binomial Distribution

Modelling species abundance using the Poisson-Tweedie family

Top-k Reliability Search on Uncertain Graphs

Modelling Count Data Time Series with Markov Processes Based on Binomial Thinning

Linear grouping using orthogonal regression

Negative binomial time series models based on expectation thinning operators

AliGraph: A Comprehensive Graph Neural Network Platform

Contact Info

Product

Resources

About