Pengkun Yang scite author profile

Consider the problem of estimating the Shannon entropy of a distribution over k elements from n independent samples. We show that the minimax mean-square error is within universal multiplicative constant factors of k n log k 2 + log 2 k n if n exceeds a constant factor of k log k ; otherwise there exists no consistent estimator. This refines the recent result of Valiant-Valiant [VV11a] that the minimal sample size for consistent entropy estimation scales according to Θ( k log k ). The apparatus of best polynomial approximation plays a key role in both the construction of optimal estimators and, via a duality argument, the minimax lower bound.

show abstract

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

Wu¹,

Yang²

2019

Ann. Statist.

125

View full text Add to dashboard Cite

We consider the problem of estimating the support size of a discrete distribution whose minimum non-zero mass is at least 1 k . Under the independent sampling model, we show that the sample complexity, i.e., the minimal sample size to achieve an additive error of k with probability at least 0.1 is within universal constant factors of k log k log 2 1 , which improves the state-of-the-art result of k 2 log k in [VV13]. Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in O(n + log 2 k) time and attains the sample complexity within a factor of six asymptotically. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.

show abstract

Utilization of text mining as a big data analysis tool for food science and nutrition

Tao

Yang

Feng

2020

Comp Rev Food Sci Food Safe

124

View full text Add to dashboard Cite

Big data analysis has found applications in many industries due to its ability to turn huge amounts of data into insights for informed business and operational decisions. Advanced data mining techniques have been applied in many sectors of supply chains in the food industry. However, the previous work has mainly focused on the analysis of instrument‐generated data such as those from hyperspectral imaging, spectroscopy, and biometric receptors. The importance of digital text data in the food and nutrition has only recently gained attention due to advancements in big data analytics. The purpose of this review is to provide an overview of the data sources, computational methods, and applications of text data in the food industry. Text mining techniques such as word‐level analysis (e.g., frequency analysis), word association analysis (e.g., network analysis), and advanced techniques (e.g., text classification, text clustering, topic modeling, information retrieval, and sentiment analysis) will be discussed. Applications of text data analysis will be illustrated with respect to food safety and food fraud surveillance, dietary pattern characterization, consumer‐opinion mining, new‐product development, food knowledge discovery, food supply‐chain management, and online food services. The goal is to provide insights for intelligent decision‐making to improve food production, food safety, and human nutrition.

show abstract

Optimal estimation of Gaussian mixtures via denoised method of moments

Wu¹,

Yang²

2020

Ann. Statist.

View full text Add to dashboard Cite

Per-packet load-balanced, low-latency routing for clos-based data center networks

et al. 2013

View full text Add to dashboard Cite

Clos-based networks including Fat-tree and VL2 are being built in data centers, but existing per-flow based routing causes low network utilization and long latency tail. In this paper, by studying the structural properties of Fattree and VL2, we propose a per-packet round-robin based routing algorithm called Digit-Reversal Bouncing (DRB). DRB achieves perfect packet interleaving. Our analysis and simulations show that, compared with random-based loadbalancing algorithms, DRB results in smaller and bounded queues even when traffic load approaches 100%, and it uses smaller re-sequencing buffer for absorbing out-of-order packet arrivals. Our implementation demonstrates that our design can be readily implemented with commodity switches. Experiments on our testbed, a Fat-tree with 54 servers, confirm our analysis and simulations, and further show that our design handles network failures in 1-2 seconds and has the desirable graceful performance degradation property.

show abstract

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

Wu¹,

Yang²

2015

Preprint

View full text Add to dashboard Cite

Sample complexity of the distinct elements problem

Yang

2018

Math. Stat. Learn.

View full text Add to dashboard Cite

Abstract. We consider the distinct elements problem, where the goal is to estimate the number of distinct colors in an urn containing k balls based on n samples drawn with replacements. Based on discrete polynomial approximation and interpolation, we propose an estimator with additive error guarantee that achieves the optimal sample complexity within O.log log k/ factors, and in fact within constant factors for most cases. The estimator can be computed in O.n/ time for an accurate estimation. The result also applies to sampling without replacement provided the sample size is a vanishing fraction of the urn size.One of the key auxiliary results is a sharp bound on the minimum singular values of a real rectangular Vandermonde matrix, which might be of independent interest.Mathematics Subject Classification (2010). 62G05; 41A05, 41A10, 62C20, 62D05.

show abstract

Minimax rates of entropy estimation on large alphabets via best polynomial approximation

Wu¹,

Yang²

2014

Preprint

View full text Add to dashboard Cite

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Pengkun Yang

Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

Utilization of text mining as a big data analysis tool for food science and nutrition

Optimal estimation of Gaussian mixtures via denoised method of moments

Per-packet load-balanced, low-latency routing for clos-based data center networks

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

Sample complexity of the distinct elements problem

Minimax rates of entropy estimation on large alphabets via best polynomial approximation

Contact Info

Product

Resources

About